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DESIGN OF POLYKEUDE SYNTHASE GENES 

FIELD OF INVENTION 

The present invention provides methods for the analysis of pol^etides and the design of 
pol)icetide syofliase genes* The invention relates to fhe fields of conq>tttational analysis^ 
chemistry, molecular biology, and medicine. 

•.ij/v' , ' ...... 

BACKGROUND OF.THE INVENTION 

The class of compoimds known as polyketides is a large family of diverse compomids 
synthesized primarily fiom 2-cdrhon unit building block conq>ounds through a series of 
condensations and subsequent modifications. Polyketides occur in many types of organisms, 
including fimgi and myceUal bacteria such as the actinomycetes. There are a wide variety of 
polyketide structures^ and the class of pol>icetides encompasses num^ous con^unds with 
diverse activities. Epothilone, er3^thromycm, FK-506, EKl-520, megalomicin, narbomycin, 
oleandomycin, picromycin, rapamydn, spinocyn, and tylosin are examples of such compounds. 

Given the difficulty in producing polyketide compounds by traditional chemical 
methodology, and the typically low production of poljdcetides in wild type cells, there as been 
considerable interest in finding improved or alternate means to produce polyketide compounds. 
See PCT Publication Nos. WO 95/08548; WO 96/40968; WO 97/02358; and 98/27203; Unites 
States Patent Nos. 5^62,290; 5,672,491; and 5,712,146; Fu et aL, Biochemistry 33: 9321-9326 
(1994); M(£)aniel etal. Science 262:1546-1555 (1993); and Rohr, Angew. Chem. Int. Ed. EngL 
34(8): 881-888 (1995), each of which is incorporated herein by reference. 

Polyketides are synthesized in nature by polyketide synthase (PKS) enzymes, Iliese 
enzymes, which are conq)l6xes of multiple large proteins, are similar to the synthases that 
catalyze condensation of 2-carbon unit building block compounds in the biosynthesis of &tty 
adds. The genes tiiat encode PKS enzymes usually consist of three or more open reading 
firames (OSFs). Two major types of PKS enzymes are known that differ in their composition 
and mode of synthesis. These two major types of PKS enzymes are commonly referred to as 
Type I or •'modular'' and Type H or "iterative" PKS enzymes. 
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Modular PKSs produce many differeat pol>^etides» including a large number of 12-, 14-, 
and 16-membered macrolide antibiotics including erythromycin, megalomicin, methymycin, 
naibomycin, oleandomycin, picromydn, and tylosin. Each QRF of a modular PKS can 
conqprise one, two, or more '^modules" of k^osynthase activity, each module of which consists 
of at least two loading module) and more typically three (fix the simplest extender module) 
or more enzymatic activities or ^'domains."* These large multifunctional enzymes (>300,000 
kDa) catalyze the biosynftesis of pol^etide macrolactones tim>ugh m 
involving decarboxylative condensations between acyl ^oesters followed by cycles of varying 
iS-carbon processing activities (see O'Hagan, D., The pol}dcetide metabolites, E. H(»:wood» New 
York, 1991, whidi is incoiporated horein by reference). 

During the past half decade, the study of modular PKS fimction and specificity has been 
greatly &cilitated by the plasmid-based Strqftomyces codicolor eiqpression system developed 
with the 6-deoxyerytfaronolide B (6-dEB) synthase CDEBS) genes (see Kao et a/., Sdence, 265: 
509-512 (1994), McDaniel et al^ Science 262: 1546-1557 (1993), and U.S. Patent Nos. 
5,672,491 and 5,712,146, each of which is incorporated hoein by reference). The advantages to 
this plasmid-based genetic system for DEBS are that it overcomes the tedious and limited 
techniques for manipulating the natural DEBS host organism, Saccharopolyspora erythraea, 
allows more facile construction of recombinant PKSs, and reduces the complexity of PKS 
analysis by providing a "clean" host backgroimd. TTiis system also expedited construction of a 
combinatorial modular pol>icetide library in Streptomyces (see PCX publication No. WO 
98/493 15, incorporated herein by reference). 

The ability to control aspects of polyketide biosynthesis, such as monomer selection and 
degree of /5-carbon processing, by genetic manipulation of PKSs has stimulated great interest in 
flie combinatorial engineering of novel antibiotics (see Hutchinson, Curr. Opin. Microbiol 1: 
319-329 (1998); Catreras and Santi, Curr. Opin. Biotech. 9: 403-411 (1998); and U.S. Patent 
Nos. 5,962,290; 5,712,146; and 5,672,491, each of which is incorporated herein by reference). 
This interest has resulted in the cloning, analysis, and manipulation by recombinant DNA 
technology of genes that encode PKS emymes. The resultmg technology allows one to 
manq>ulate a known PKS gene cluster either to produce the polyketide synthesized by that PKS 
at higher levels than occur in nature or in hosts that otherwise do not produce the polyketide. 
The tedmology also allows one to produce molecules that are structurally related to, but distinct 
fiom, the polyketides produced jfrom known PKS gene clusters. 
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Pol^etides are assembled by poljicetide synthases ttnougji successive condensadons of 
activated coenzyine-A ftuoester monomers derived fiom small organic acids such as acetate, 
propionate, and butyrate. Active sites required for condensation include an acyltransferase 
(AT), ac^ canier protein (ACP), and beta-ketoacylsyntliase (KS). Each condensation cycle 
results in a /34ceto groiQ> that undergoes all, some^ or none of a smes of processing activities. 
Active sites fhat perform diese reactions inchide a ketoreductase (KR), dehydratase (DH), and 
enoybeductase(ER). Thus, the abseticeofanybeta-keto processing domain results in the 
presence of aketone, aKR alone gives liseto ahydroxyl, a EK. and DH result in an alkene, 
white a KR,DH, arid ER combination leads to conq>lete reduction to an alkane^ Afterassembly 
of flie polyk^de chain, the molecule typically undergoes cyclization(s) and post-PKS 
modification (eg. glyoosylation, oxidation, acylation) to achieve the final active compound 

To illustrate the synthesis of a macrolide by a modular PKS (see Cane et al^ Science 
282: 63 (1 998), incorporated herein by reference), one can refer to the PKS that produces the 
erythromycin pol^etide (^eoxyerybronolide B synthase or DEBS; see U.S. Patent No. 
5,824,513, incorporated herein by reference). In the modular DEBS PKS enzyme, the enzymatic 
steps for each round of condensation and reduction are encoded within a single **module" of the - 
polypeptide (i.e., one distinct module for every condensation cycle). Als shown in Figure 1, 
DEBS consists of a loading module and 6 extender modules and a chain terminating thioesterase 
(TE) domain within three extremely large polypeptides encoded by three open reading firames 
(QRFs, designated eryAI^ eryAJJ, and eryAIII), 

Each of the three polypeptide subunits of DEBS (DEBl, DEBS2, and DEBS3 iti 
Figure 1) contains 2 extender modules. DEBSl additionally contains the loading module, and 
DEBS3 contains the TE domain. Collectively, these protems catalyze the condensation and 
q>prqpriate reduction of one propionyl CoA starter unit and six methyhnalonyl CoA extender 
units. Modules 1, 2, 5, and 6 contain KR domains; module 4 contains a conq>lete set; 
KR/DH/ER, of reductive and dehydratase domains; and module 3 contains no functional 
reductive domain. Following the condensation and appropriate dehydration and reduction 
reactions, tiie enzyme bound intermediate is lactonized by the TE at the end of extender module 
6 to ferm 6-dEB (compound 1 in Figure 1). 



More particularly, the loading module of DEBS consists of two domains, an acyl- 
transferase (AT) domain and an acyl carrier protein (ACP) domain. In other PKS enzymes, the 
loading module is not composed of an AT and an ACP but instead utilizes a partially inactivated 
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KS, an AT» and an ACP. This partially inactivated KS is in most instances called KS^, where 
the siq>erscript letter is the abhreviation for the amino acid, ghdamine, that is present instead of a 
cysteine in the active site that is beUeved to be required for ^ Alfliou^ihe 
KS^doinain lacks condensation activity, it retains decaiboxylaseacti^^ TheATdomainof 
the loading module recognizes a particular acyl-CoA (propionyl for DEBS, \(4nch can also 
accqitacetyO and transfers it as a thiol ester to the ACP of the badingm^ Concurrentiy, 
the AT on each of the extender modules recognizes a particular esctender-CoA (methjdmalonyl 
for DEBS) and transfers it to the ACP of tiiat module to form a fhioester. Once the PKS is 
primed with acyl- and malonyl-ACPs, the acyl gfoup of the loading module migrates to &nn a 
thiol ester (trans-esterification) at the KS of the first extender module; at this stage, ^ctend^ 
module 1 possesses an acyi-KS and a methyhnalonyl ACP. The acyl groiq) d^ed from tiie 
loading module is then covalentiy attached to the alpha-caibon of the malonyl group to form a 
carbonrcaibon bond, driven by concomitant decarboxylation, and gmfarating a new acyl-ACP 
that has a bacld^one two caihons longer tim the loading unit (elongati^^ The 
growing pol^etide chain (various intermediates are shown in Figure 1) is transferred from the 
ACP to the KS of the next module, and the process continues. 

The pol>icetide chain, growing by two carbons each module, is sequentially passed as a 
covalentiy bound thiol ester from module to module, in an assembly line-like process. The 
carbon chain produced by this process alone would possess a ketone at every other carbon atom, 
producing a polyketone, from which the name polyketide arises. Commonly, however, 
additional enzymatic activities modify the beta keto group of the polyketide chain to which the 
two carbon unit has been added before the chain is transferred to the next module. Modules may 
contain additional enzymatic activities as well, such as methyl transferase domains, but there are 
no such additional activities in DEBS. 

Once a polyketide chain traverses the final extender module.of a modular PKS, it 
encounters the releasing domain or thioesterase found at the carboxyl end of most PKSs. Here, 
the polyketide is cleaved from the enzyme and cyclyzed The resulting polyketide can be 
modified fiirther by tailoring or modification enzymes; these enzymes add carbohydrate groiq)s 
or methyl groups, or make o&er modifications, i.e., oxidation or reduction, on the pol)1cetide 
core molecule. For example, the final steps in conversion of 6-dEB to erythromycin A include 
the actions of a number of modification enzymes, such asf C-6 hydroxylation, attachment of 
mycarose and desosamine sugars, C-12 hydroxylation (which produces erythromycm C), and 
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conversion of mycaiose to cladinose via O-mefhyladon. These modifications in various 
combinations lesult in eryQuomydns A (compomid 2 in Figure 1), B» C» and D. 

While the detailed understanding of the mechanisms by which PKS enzymes function 
and the development ofmethods for manqmlatihgl^ genes have &^ . 
novel pol^etides, there remain substantial impediments to the creation of novd polyketides by 
genetic engineeting. One such unpediment is the availability of PKS genes. Man/ pol^etides 
are known but only a rdatively small portion of the corresponding PKS genes have been cloned 
and are available for manipulation. Moreover, in many instances the producing organism for an 
intei^Bstuig pol^etide is obtainable only with great difficulty and expense, and techniques for its 
growth in the laboratory and production of the polyketide it produces are unknown or difScult or 
time-consunaing to practice. Also, even if the PKS genes for a desired polykdide have been 
cloned, those genes may not serve to drive the level of production desired in a particular host 
celL 

If there were a method to produce a desired polyketide without having to access the 
. genes that encode the PKS that produces the polj^cetide, then many of these difficulties could be 
ameliorated or avoided altogether. The present invention meets this need. 

SUMMARY OF THE INVENTION 

hi one embodiment, die present invention provides methods for the computational 
analysis of pol^etides and the con4)uter-assisted design of PKS genes. 

hi a first aspect, the present invention provides a method for representing the structure of 
a po]}4cetide and/or a PKS gene that eaicodes the PKS that produces flie polyk^de by 
a^hanumeric symbols that facilitates computer assisted analysis. 

In a second aspect, the present invention provides a database of polyketides and 
corresponding PKS genes that can be rapidly searched and information extracted for a variety of 
applications. More particularly, this database can include, in one mode, all known polyketides; 
and in another mode, the polyketides, optionally including all intermediates, produced by all 
known PKS genes or a subset thereof. 
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In a third aspect, tbe present invention provides a method for predicting die structine of a 
PKS and its corresponding genes fiom die structure of a pol^etide. 

In a jfouidi aspect, the present invention provides a method for designm 
g0iescq>ableofproduciiig a desired poiykietide. This aspect ofthe invention is directed to die 
design and specification of FKS genes via the recombining of modules or pcnttions of modules or 
sets of modules fiom abeady known and available PKS genes. In one mode, all possible PKS 
genes encoding a desired poiytetide fixnn a set of genes m a database are generated. In^other 
mode, only a subset of sudi possible PKS genes is generated based on one or more parameters 
selected by file user. More particularly, a rating system is provided to sort the PKS genes 
designed fin: a particular target pol^etide based on any one more of several criteria, including 
number of non-native module inter&ces, number of ncm-native protein int^rfoces, and other 
parameters as more particularly described below or selected by the user. 

In another embodiment, the present invention provides methods and reagents for 
preparing novel PKS genes that encode PKS enzymes that produce a desired polyketide. 

In a first aspect, the present invention provides a library of recombinant DNA 
compounds, wherein eadi member of said library encodes a module of a PKS or portions of 
modules or sets of modules having a desired specificity, and the library as a whole encompasses 
all of die members of a desired class of specificities. 

In a second aspect, die present invention provides a mefiiod for assembling a PKS gene 
cluster that encodes a PKS that produces a desired polyketide fiom known and available PKS 
genes oth^ than the naturally occurring PKS genes that produce the polyketide in nature. 

These and oth^ embodiments, modes, and aspects of the invention are described in more 
detail in the following description, the examples, and claims set forth below. 

BRIEF DESCRIPTION OF THE FIGURES 

Figure 1 shows a schematic representation of the PKS enzyme that synthesizes 6- 
deoxyerythronolide B (6-dEB, compoimd 1). The PKS is composed of three proteins, DEBSl, 
DEBS2, and DEBS3, each of which is represented by an arrow and contains two or more 
modules. Each module is represented by a solid line, and the domains in each module are shown 
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inside Oie anow. Various tntemediates produced duriiigfhesyn^ 

structures of erythromycins A (compound 2)» B, and D resulting fipom modification of 6-dEB. 

Figure 2 shows an illustrative set of 2-carbon unit monomers present in macrocyclic 
polyketides; tiiese monomers can be used to rqpresent poljdcetide backbone diversity generated 
by commonly used starter and extender units (malonjd CoA and mefhylmalonyl Co A) and Ifae 
condensation and reduction reactions mediated by FKS enzymes. 

Figure 3 shows a representation of 6-dEB by molecular grq)h, CHUCKLES notation,' ' 
and SMILES notation. The CHUCKLES notation uses the 2-carbon unit monomers shown in 
Figure 2,.Ih the CHUCKLES notation, the order of attachment of monomers is designated by., 
the order in which monomers are listed, and tiie attachment points within the monomers are 
specified in their definitions. In the SMILES notation, adjacent monomers are attached via 
single (covalent) bonds depicted by dashes. The cyclization bond is rq)resented by the index 1 
adjacent to the Start and Close monomers. 

Figure 4 is a flowchart and block flow diagram in five parts designated A-E, inclusive. 

Flowchart Figure 4A is a block flow diagram of a computer system to design a novel 
PKS (and corresponding genes). j 

Flowchart Figure 4B is a block flow diagram wherein the "Computer Program" block (2) 
of Flowchart Figure 4A is fijrther defined. 

Flowchart Figure 4C is a block flow diagram wherein the 'T)esign novel hybrid PKS 
genes from library for TARGET" block of Flowchart Figure 4B is further defined. 

Flowchart Figure 4D is a block flow diagram wherein the "align TARGET with 
STARTER; copy to ALIGNMENT' block of Flowchart Figure 4C is fijrther defined. 

Flowchart Figure 4E is a block flow diagram whwein the 'Tlate novel hybrid designs" 
block (3) of Flowchart Figure 4B is finiher defined. 

Figure 5 shows a flowchart of a matching method for the generation of the CHUCKLES 
Strings used for all polyketides in a library. 
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DETAILED DESCRIPnON OF THE INVENTION 

Because polykeddes synthesized by modular PKS genes are built by the enzymatically 
controlled addition of primarily 2-carbon unit monomers and, to a lesser extent, other more 
complex monomerSy each polyketide may be represented as a string of 2-caibon unit and other 
monom^. These monomers represent the portion of tbie poljicetide backbone structure as a 
result of the incorporation of various starter and extends units (malonate, methyl malonate, etc.) 
and the subsequent diemical reactions. 

These reactions inchide: 

(1) condensation reactions, of which^ere are three basic reactions: malonyl-CoA 
condensation aiid methylmalonyl-CoA condensation with the branched methyl having either R 
or S stereochemistrjr, and 

(2) reduction reactions, of which there are five basic reactions: no reducticm (ketone 
preserved), keto-reduction (to yield a hydroxyl having either R or S stereochemistry), 
dehydration (trans double bond), and enoyi-reduction (to yield a methylene). 

An illustrative set of the basic monomm that can be used to represent a polyketide 
structure (and their corresponding symbols) comprises: 
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A xniscellaneous monomer, Q, can be used to denote a portion of flie pol^etide stnictine. that 
cannot be assigned by monomeis A-N. 

The monomer set shown above and in Figoie 2 does not represent the actual monomers 
incorporated during biosynthesis. Bistead, these monomers indude a carbon fiom two differrait 
biosynbetic monomers. This is best esqplained usmg apol]4cetide jQ:agment depicted below. 




CH3 CH3 



The fragment inchides two two-carbon units, i and i+1 and part of a third two-carbon-unit, i-1 
that were incoiporated into the poljietide during biosynthesis. The i-th extender module .' 
attaches the two carbon biosynthetic unit whose backbone caibons are designated as alphai and 
betai and the second extender module attaches the two carbon biosynthetic unit whose backbone 
carbons are designated as alphai+i and betaj+i. Using the monomer set shown above, this 
fragment consists of monomer A (derived from the beta carbon added in module i+1 and the 
alpha carbon added in module i) and another monomer A (derived from the beta carbon added in 
module i and the alpha carbon added in module i+1). 




I I I I 



A A 

The fifth carbon designated beta*^* remains unassigned and will depend on the identity of the 
two-carbon biosynthetic unit that is incorporated in the polyketide by module i+2. 

The set of monomers shown in Figure 2 can be expanded to include other starter and 
extender units, of which there are many. Such starter and extender units include, for example 
but without limitation, hydroxymalonate (e.g., niddamycin), methoxymalonate (e.g. FK-520), 
ethyhnalonate (e.g., FK-520), amino acids or amino acid derivatives that are incorporated into 
polyketides by the action of a non-ribosomal peptide synthase (e.g., thiazole in epothilone and 
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pq>ecoIate in rapamycmX or other tmits incorporated by, for exaxcple, an AMP figase (e.g., fhe 
dihydroxycylohexyl moiety in npamycin^ FK1-S06, and FK-S20) or a soluble CoA ligase. An 
illustrative set of additional starter and extender units includes: 




= J'; .^~^= K';.„d = M' 

R ft A 

where R can be anything other than hydrogen or mefliyl (e.g., allyl, butyl, ethyl, hexyl, hydroxyl, 
isobutyU and methoxy). 

The set of monomers can also include post-PKS modifications, such as hydroxylation, 
mefhylation, q>oxidation, glycosylation, or addition of intra-macrocyclic fused rings makuig the 
system polycyclic. Also, a variety of mediods are known for die incorporation of unusual starter 
and or extender units in poljdcetide synthases (see, e.g., PCX Publication Nos. WO 97/02358; 
WO 99/03986; WO 98/01546; and WO 98/01571, each of which is incoiporated herein by 
reference, and flie monomer set can include such units. 

By viewing polyketides as composed of sets of distinct monomers, one can in 
accordance with the present invention define a polyketide as a string of alpha-numeric symbols 
to facilitate computer analysis. In one method, a modified CHUCKLES methodology for 
representing polyketides is used. The CHUCKLES mediodology (see Siani et aL, 
*'CHUCKLES: a method for representing and searching peptide and peptoid sequence," J. 
Chem, Inf. Sci, 34: 588-593 (1994) which is incorporated herein by reference) for representing 
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pq>tides and related oligomers allows monomers to be strung together such that the molecular 
graph for the basic xnaciocycle can be generated &om the string of monomers. 

For exan^le, using the set of monomers comprising A-N described above, the 
erythromycin macrocycle or 6-dEB can be rq)resented as ADGJDD. Tbisstringof 
a^hanumericsymbokisalsorefenredtoastfaeCHUCXIJBSstn Figure 3 depicts flie 
relationsh^ between the CHUCKLES string, tfie SMII£S string, and the actual molecular 
structure of 6-dEB. The CHUCKI£S string fot6-dEB can be amiotated to represent the 
structure of crythromyrin A: A(l-lactcttie closure^-hydroxyOIX3J(2-hydroxyl)D(l-gIyco5^) D(l- 
glycosyl). Thus, ring closure (cyclization) and post-synthetic modifications (glycosylation and 
hydroxylationX and non-standard units where ^licable (fliero are none in and 
. erythromycin) are entered between parentheses after ^ach monomer. Another example is an 
amiotated CHUCKLES string for epothilone B: ME(l-lactone-closure)M(epoxide)LJDG(2- 
methylation)E. As above, cyclization, post-synthetic modifications (epoxide formation), and 
non<^tandard units (methyl at C-4) are entered between parentheses after each monomer. 

In another aspect of the present invention, a database of polyketides is provided. In one 
aspect of the present invention, the polyketides are represented by a string of defined monomers, 
hi one embodiment, the monomers are selected fi:om a group consisting of: 
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Li anofher embodiment, pol]4cetides are represented by the monomers A-N as well as 
additional monomers selected fitom flie gcoiq> consisting of 



2" . OH 



R 



OH 



R 



OH 



= G'; X/ = H'; 
R ^ R 




= M' 



where R can be anything other than hydrogen or methyl. 

The string of monomers can be represented as a linearized structure or as a string of 
symbols. For example, the erythromycin can be represented as its aglycone, 6-dEB, as 




or as a string of symbols, ADGJDD. Optionally, the string of symbols can be annotated as "A(l- 
lactone closiire,2-hydroxyl)DGJ(2-hydroxyl)D(l-glycosyl) D(l-glycosyl)" to more fiiUy capture 
the erythromycin structure. This set of annotated strings is referred to as a "coded library** or a 
"coded" database of the present invention. 



In an illustrative embodiment, the polyketide database consists of the polyketides 
described in current literature (Journal of Antibiotics (1981-present), Journal of Natural 
Products) and various databases (Chemical Abstracts CAPlus, AntiBase). All unique 
macrocyclic polyketides are converted to the modified CHUCKLES format Of the -1000 novel 
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pol^cetides obtained, only *200 different strings of monomers and unique macrocycles are 
needed to Tcpresent the much larger collection of pol^etides in the database, because many of 
the differences between the naturally-occurring pol^etides are due to different gtycos;^ (sugar) 
.groups attached at different positions on tfaemacrocycle. 

Thus, a maorocyclic polyketide can be converted to a string of 2-caibon mon(»ners by 
mapping the monomers onto the polyketide. This can be perfijnned manually or with computer 
assistance. First, any sugar moieties are conceptually removed by hydrolysis and any lactones 
(bond between the ketone and oxygen) are hydrolyzed ttms generating a linearized structure of 
the backbone of the polyketide. Generally, this leaves a carboxy carbon at one end of the linear 
molecule and a hydroxy! at the other. The pol}4cetide is fhm "sequenced** manually or in silica 
from the end containing the carboxy carbon, the end corresponding to the last monomer added 
by the PKS before synthesis is complete. This end serves as a convenient handle from which to 
start the nothing process. Althougji closing of the lactone often occurs between the two ends of 
the polyketide, this is not always the casa However, the last ketone added by the PKS is ahnost 
always involved in macrolactone formation and so serves as a more convenient handle than the 
hydroxyl for commencing sequ^acing. 

The manual or in silico sequencing is performed by matching the monomers, one at a 
time^ while traversing the macrocyclic backbone. First the carboxy carbon is skipped, and an 
attempt is made to match each of the monomers in the monomer set selected (i.e., monomer set 
A-N in Figure 2) against the next two carbons in the macrocycle. The match takes into account 
carbon, oxygen, and no substitution at each backbone position, chirality at each backbone 
position, and bond order between the two backbone carbons. 

If the sequencing is performed in silico, the method is referred to as back-translation and 
involves converting a molecular graph into a string of monomers. First, the monomer hl>rary is 
converted to SMARTS format SMARTS is a si^erset of the SMILE3 language that specifies a 
pattern in a molecular graph (Dayli^t Software Manual: Theory; Daylight Chemical 
Information Systems; Irvine, CA 1993, mcoiporated herem by reference). SMARTS permits 
one to specify a variable number or a limit on the number of coval^t bonds to non-hydrogen 
atoms from a particular atom. Jn contrast, SMILES assumes that the unspecified valences are 
hydrogens. For example, the SMILES string for monomer A is [C@@H](0)[C@H](C). The 
oxygen may be bonded to any other single atom; if the atom is not specified, it is assumed be a 
hydrogen. In the SMARTS string for monomer A, [C@@H](0;D2])[C@H]([CH3]), one can 
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specify ike exact number of hydrogem on some atoms (e.g., ''CID"). In addition, the '^0;D2]'* 
indicates the oxygen is bonded to two (fiom D2) non-hydiogen atoms, in this caseflie first 
caibon and some other unspecified atom. This allows matching and distinction of post- 
modification moieties attached to the oxygen as well as additional cyclizations (six member 
rings can occur within the maax>cycle; lapamcyin). Thus, the SMARTS notation allows 
pattern matching against flie polyketide molecular graph. 

When a match occurs, the atoms that match are tagged as part of a supeiset and labeled 
with the monomer name. Any atoms that are connected to the monomer that are not part of the 
macrocycle are tagged for identification as special precursor units (ag., ethyfanalonate instead of 
methyl malonate or malonate), orpost-synthetic modification moieties (e.g., sugiars, CCHO, 
hydroxylation, methylation). If all the atoms and bonds of the monomer cannot be identified, 
the monomer.is given a designation to indicate the lack of identification (e.g., Q for question 
mark). These Q monomers can be used to identify monomers that are the site of post-PKS 
modifications that mask the function of the PKS module that generated that portion of the 
poI>icetide or that are not in the monomer set and so prevent the correlation of a particular 
segment of the backbone with one of the monomers in the monomer set 

After a particular 2-caibon unit is identified, the next two carbons are processed the same 
way. This is repeated until all the backbone carbons are identified and labeled as monomers. 
When all two-carbon units are identified, one has generated an ordered sequence, or string, of 
monomers, which is a modified CHUCKLES string of the invention. Moieties corresponding to 
post-PKS modifications are appended to the monoma: in the string as an annotation in 
parentheses. This method of sequencing may "be extended to include any type of monomer. 
Figmre 5 shows a flow chart of this matching method for the generation of the CHUCKLES 
strings used for all poljdcetides ui a library. 

The CHUCKLES string can be in the order corresponding to the direction of 
biosyntiiesis on the PKS or its reverse. Each CHUCKLES string has a one-to-many relationship 
with the PKS gene in the producing organism. Thus, while many different organisms can 
produce the same pol:^etide using the same or different PKS genes, each PKS gene generally 
produces only one PKS that produces only one pol>ketide (some AT domains can bind different 
CoAs, leading to the production of multiple polyketides from a single PKS). This allows one to 
design, from the polyketide structure, a set of PKS genes that would produce that polyketide. 
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Thus, Ibe preseat inventiQn provides methods and computatioiial analysis tools for 
designing FKS genes to pnnluce a desired polykedde. As an illustrative exanq>le, the present 
invention provides a computo- program termed MORFH (see the Examples below) Oat can read 
the coded lihcary (see die &canq>les below). An ilhistiative coded library consists of -200 
unique pol}dcetide CHUCKLES strings. The user specifies the target polyketide» which is 
converted 6om molecular structure to a CHUCKLES string. 

The program then p^orms the following, starting with each library conq)oiind or string: 

(1) aligns library compound and target compound, emphftgiTing alignment of adjacent 
monomers common between.the two; 

(2) fills in the gaps using all possible combinations fiom all library members; 
* (3) counts number of non-natural inter-modular boundaries, 

(4) outputs all these alignments. 
The alignments are then sorted based on the number of non-natural inter-modular 
boundaries. 

This illustrative program allows one to design and Md PKS genes that encode PKS 
enzymes that are combinations of two or more different PKS enzymes with the fewest inter- 
modular boundaries, and optionally the fewest inter-protein boundaries. Many other alternative 
embodinfents are provided by the present invention. 

For example, one can include the naturally occurring PKS that produces the target 
pol^etide in the coded library to allow components of that PKS to be incorporated into the 
design of a new PKS. Also, one can include in the coded library non-naturally occurring PKS 
enzymes, such as those produced and published in the scientific and patent hterature to make 
novel poljdcetides, in the coded hbrary. See, e.g., PCT publication Nos. WO 98/493 15 and WO 
96/40968, both of which are incorporated herein by reference. 

This CHUCKLES-coded polyketide library can be stored in a computer file as a set of 
records. In one embodiment, each record contains the chemical name of the polyketide, the 
unaimotated CHUCKLES (containing basic macrocyclic monomers), the annotated 
CHUCKLES (containing basic macrocyclic monomers with information about post-PBK 
modifications), the producing organism(s), and other information (e.g., linearized representation 
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of the pol)4cetide structure, fhe accession number of organisms or plasmids that have been 
deported, g^e sequence information, and references). 

The MORPH program can read in fhe pol^etide library entries to an array or list of dat 
structures, where each entry data structure contains all or a selected subset of the jSelds in each 
blnaryrecord. The MORPH program then reads in the C3IUCKI£S-^ 
pol^c^de fiom fhe user. This TARGET may optionally be blocked fixxm the Jlibrary so that it : 
not used as a STARTER or left in the library, i.e., if it is only distantly related to other known 
pol^CBtides, or some modules could be usefid in designing novel PKS genes, or it is desired to 
replace only certain PKS modules. This program could also be used jEbr analoging at a particul 
. position via wild-cards defined as part of the TARGET sequence by the user. 

Each member of the coded PKS library can be selected as a STARTER unit Thus, 
during a run, all library members can be given an equal chance as STARTER units. After a 
STARTER is chosen, the TARGET is aligned with it. See Flowchart Figure 4D. Any mefliod 
of alignment can be used such that the maximal number of adjacent STARTER modules is use« 
in the final alignment After the maximal adjacent modules are used in the ALIGNMENT, 
smaller adjacent sets or individual modules from the STARTER are used to fill in the gaps: 
There may be several, alignments that are equally good based on the attempt to optimize the 
number of adjacent modules. For example, if the TARGET contains the "JDG". substring, then 
6-dEB, identified with the A1D2G3 J4D5D6 CHUC3CLES string, may align as 



TARGET 


JDG 


6-dEB 


J4D2G3 


TARGET 


JDG 


6-dEB 


J4D5G3. 



Both of these alignments have different maximal adjacent modules, with the 
same length oftwo(D2G3 in the first and J4D5 in the second). Accordingly, either 
alignment could be used as STARTERs. 

With the optimized alignment from the STARTER, other library entries are 
systematically used to complete Ihe alignment, or fill in the gaps. This part may be performed 
on either the optimized ALIGNMENT described above, or the ALIGNMENT without the singi 
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modules JErom the STARTER; the zemoval of the individual modules opens up more space mto 
i^ch larger pieces oftfae FILLER might be placed The first libiaiy entry is designated as the 
FILLER. Ifthe FILLER is the same as the STARTEE^^ next lihraiy entry 
FILLER. This library entry is flagged as the CmRBOTJlI^^ The 
same method fixr finding nmimally adjacent modules and flien smaller sets or single modules is 
used to fiU the gq)s in ALIGNME^^*fix)m the FILLER. Ifnot all die gaps are filled in the 
ALIGNMENT, then the next library entry is used as a new source; that is, it is designated as the 
FILLER, and the gq)s are filled further. This is repeated until the AIJGNMENT is co^^ 
the end of die library is reached. 

Assuming aU modules in the TARGET are represented m die iibraiy^ • 
is eventually conxpletely filled The conq)leted alignment is then written to an ou^ut file on the 
computer disk. When the ALIGNMENT is complete, or there are no more FILLERS in the • ' 
library, the TARGET and STARTER alignment are re-copied to ALIGNMENT. The 
CmR£NTJ^LERJUQ3RARY_ENTRY is incremented, and a new attenq)t to fill in the gaps 
is started 

When the (:mRENTJ?ILLER__LIBRARY_ENTRY has reached the end of the library, 
the ALIGNMENT is wiped, and a new STARTER is chosen. The above process is then 
repeated for the next STARTER, When all Ubrary entries have been used as starters, then all 
feasible novel polyketide synthases have been generated and written to the computer file. The 
novel PKSs are then read back into memory and can be further evaluated. An illustrative 
evaluation process involves: 

(1) counting the non-native inter-module interfaces, and 

(2) counting tihe numb^ of native inter-protein interfaces (for known and 
annotated gene sequences). 

The novel PKSs are then sorted based on these two numbers, giving higher priority to die non- 
native inter-module interfaces. la this mode, the goal is to identify those novel PKSs that contain 
die fewest non-native inter&ces. 

By providing methods and means for die compiiter-iassisted analysis of polyketides and 
PKS genes, the present invention greatly facilitates the identification and production of new 
polyketides with useful activities. Those of skill in the art will appreciate that while the 
invention is in part illustrated in the Examples below with respect to the design of new PKS 
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genes for known pol>ketides, fhe invention can also be used to design PKS genes for novel 
poljicetides. In this embodiment, one singly provides the structure of the novel pol:^etide to 
the MORPH or olber program of the invention to generate the desired PKS genes. 

Moreover, while the invention is exemplified below by dedgning new PKS genes 
composed of the coding sequences for one or moie complete modules of two or more different 
PKS genes» partial modules can also be employed. With the sppropiiate choice of moiiamer sets 
and corresponding coding of the library to be searched, one can generate new PKS gene designs 
that take advantage of the potential to fiise one PKS gene coding sequence to cmother at a site 
corresponding to an intra-modular junction. In anothor embodiment, one can use 'Svild'-cards" 
in the encoded polyketide or library to take advantage of known or predicted SAR, Thus, if one 
knows that a particular position in a polyketide can be varied (i.e., a hydrogen, methyl, or efhyl 
group at a location determined by an AT domain of a particular module, or a hydroxyl or keto 
group at a location detennined by the presence or absence of a KR domain in a particular 
module) then one can use a wild-card monomer designation in the polyketide CHUCKLES 
string to generate PKS genes that produce each of the desired variants. 

The methods of the invention have diverse ^jplication in addition to fee design of new 
PKS genes. As but one illustrative example, the methods of the invention can be used to design 
methods to produce a desired compoimd. Organic molecules containing stereochemical centers 
are usefid for a number of purposes, including use as synthetic or semi-synthetic intermediates. 
The preparation of such intermediates by organic synthesis can be extremely time consuming 
and expensive. An altmative source of such intermediates is via specific degradation of a 
pol^etide, and the present invention provides computer-assisted means for designing such 
productian methods. 

Thus, certain fimctional groups of polyketides are susceptible to bond cleavage by 
specific chemical reactions that do not affect other functional groups. For example, carbon- 
carbon double bonds can be specifically cleaved by permanganate without affecting ofeer 
fimctional groups normally m polyketides, such as ketones, alcohols, and lactones. likewise, 
the Baeyer-Villager reaction converts a ketone to an ester Oactone) without affecting other 
groiq)s of the aglycone. In accordance with the methods of the invention, one can assemble a 
library of pol)icetides in a database that can be addressed with a query describing a particular 
chemical reaction to generate all of the degradation products produced by that reaction iq>on 
each of fee polyketides in fee library. The degmdation firagments feus generated serve as a 
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library of the inventioB ibat can be sorted by properties, such as size, mimb^ and type of 
stereochemical ceaaters^ fimctional groups, or other &ctor8» and searched for useful conqiounds. 
Moreover^ the functional groups on the ends of the fiagments genoated (or at other locations) 
can also'be converted to other fiinctiopal groiq)S by chemical reactions (optionally employing 
protecting groiq>s on other functional groups), and the database of conqpounds can be expanded 
to include l3xt conq>ounds produced by such reactions. 

From even a modest library of -200 compounds, one can in this manner generate using 
the methods of the invention, two to three times as many valuable chemical intermediates. Once 
such an intermediate is identijSed, the organism that produces the pol^etide Jfrom which the 
fragment is derived is fermented, the polyketide isolated in bulk, the chemical reaction 
performed, and the desired degradation pn)duct(s) isolated and used In this maimer, the present 
invention makes available a wide variety of useful products otherwise unattainable. 

Thus, the present invention has wide application m the fields of chemistry, particularly 
medicinal chemistry, molecular biology, and medicine. Those of skill in flie art will recognize 
these and other benefits and q>pUcations provided by the present invention. Thus, the following 
exanxples are given for the purpose of illustrating the present invention and shall not be 
construed as beixig a limitation on the scope of the invention or clakns. 

EXAMPLEl 

The MORPH Program 

This exanq>le provides the source code for an illustrative MORPH program of the 
inventioiL llie MORPH program is a coiiunand line driven program that ruiis on a UNIX 
system. The program can be run fiom a shell script, such that the user fills in the entire 
command ahead of tune, &en post-processes the output file with UNIX utilities including sort, 
egrqp, anduniq. 

The conmiand line £q;>pears as follows: 

morphS -1 Ubraryfile ^n taxgetname -t taigetsequoKe [-x X-wildcards] [-y Y-wildcards] [-z Z- 
wildcards]. 
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The library file is the name of the text file described below in Exdznple 2* The target 
name is a user-defined identifier to distinguish this target fiom the library members (e.g., 
q>otfailoneI>). The target sequence is a string of monomers fbsX repres^t the CHUCKLBS- 
encoded target pol3^etide(e.g^MEMLJDGE). Geneirally>ifthe target sequence is in Ae 
Uhrary, it is commented out fix)m the library so that the morph program ^oes not find &eta^ . 
itself The three different iviIdcards»X,Y, and are indq)endent sets of monomers th^ 
included in the target sequence. 

The ou^utfirom the morph program can be redirected to a file. This output file is then 
post-processed by (1) extracting the HIT lines with valid combinations of modules that yield the 
target, (2) sorting fiieHITS based on a^hanamaic content using the UNIX sort command* (3) 
running the UNIX uniq command which removes multiple copies of each HIT, leaving one copy 
of each, (4) sorting based on the number of pieces in the sequence of modules. Generally, the 
fewest number of pieces, which correspond to the fewest number of inter-modular interfaces, are 
desired; these will appear at the top of the ou^ut 

Below are some illustrative examples of calls to the MORPH program from a shell script 
using epothilone as a target. The first example generates combinations that yield epothilone D: 

%morph3 -1 PKS.Iib -n epoD -t MEMUDGE > omorph3_epoD 

%egrep HIT omoiph3_epoD | sort | uniq | sort +10 -11 > omorph3__epoD.umq.sort 

The second example generates combinations that yield a derivative of epothilone D having a 

hydroxyl at C-13: 

%morph3 -1 PKS.Ub -n epoD-130H -t MEXLJDGE -x ABCD > oepoD-130H 
%egrq) HIT oqx)D-130H | sort | uniq | sort +10 -1 1 > oepoD-130H.uniq.sort 

The diird example generates an epothilone having wildcards (set 1): 

%morph3 -1 PKS.lib -n epoD-setl -t MEXYZDgE -x ABCD -y LEFIN -z JACGM > 

oepoD-setl 

%grep HIT oepoD-setl | sort | uniq | sort +10 -1 1 > oq)oD-setl.uniq.sort 
The fourth example generates an q)otfaiIone having anoth^ set of wildcards (set 2): 

%morph3 -1 PKS.lib -n epoD-set2 -t MEXYZDgE -x JK -y EF -z JACGM > oepoD-set2 
%grep HIT oepoD-set2 1 sort | uniq | sort +10 -1 1 > oq>oD-set2.uniq.sort 

MORPH in its current inq)lementation operates at the monomer level and thus does not 
handle intra-modular modifications/splitting. Future implementations could convert the 
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CHUCKLES-encoded strings into die conespcmding and equivalent SMILES and then perfomi 
moie complex chemical analysis ofthePKSmoIeciilar graphs. Cimently, inter-modular double 
bonds are present in the library, but are ignored by the pi o granL These bonds can introduced 
post-biosynlfaetically and the exact source is generally unknown. 

The source code far MORPH is found in Appendix A (version 3.0) and B (version 4.0) 
(deported in the micxofiche appendix). 

EXAMPLE2 

Illustrative Polyketide Library 

This example provides the contents of an illustrative CHUCLKES aicoded polyketide 
hT)rary. The first column provides the name of the pol>icetide; the second the CHUCKLES 
string; the third the annotated CHUCKLES string and the fourth the source organism. Entries 
• under annotated CHUCBCLES and source organism are not complete for all of the polyketides. 
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POLYKETIDE 


CHUCKLES 


annotated-CBDUCKLES 


somcE 

ORGANISM 


3-ace1yl-4"- 
butyltylosm 


FMNGODF 






#aculexi]i]ycin. 


RENRSMRSRSS 
SSRLSSN 


RR(2^%1)NRSMRS(1- 

glycosyl)RSSS(24iy<lroxyl)SR(l- 

gtycosyl)LSSN(2-c%l) 




albocycIine-Ml- 

m^ramyciti'* 

TA2407- 

cineron^ciDB- 

U28010-SR2077 


BLME?=JN 


BLM(i;2-q)oxy)E(l.inetboxy)=J(2- 
hydroxyl)N 




aIbocyclm&-M2 


BLME^JN 


B(2-hydroxyl)LME(l-niethoxy)==J(2- 
liydroxyl)N 




albocycline-MS 


BSMB=JL 


BSME(l-methoxy>=J{2-hydroxyl)L 




albocycline-MS 


BSMB=JN 


BSME(l-methoxy>=J(2-hydroxyl)N 




a]bocyclme-M6 


BLMEf=JL 


BL(2"hydroxyl)ME(l-nie1hoxyH(2- 
hydroxyl)L 




a]bocyclin6-M7 


BLM&=JN 


BL(2-hydroxyl)ME(l-methoxy)=J(2- 
hy^xyl)N 




albocycline-MS 


BLMEK? 


BLME(l-metbox#)=0 




aldgainycin 


BMLGJDL 


B(l-cyc)MLG(2.hydroxyl)JDL(2- 
cyc) 




an^hotericinA 


CDNNLNNNNF 
CEFEALEE 


CDNNI2«^NNF(l-glycosyl)C(l-<)- 

cyc,2-carboxy]icacid)£F(l- 

cyc)EALEE 




#aixiphotericmB 


CDNNNNNIWF 
QQQEELEE 


CDN>n^NNNOT(l-glycosyl)C(l-^ 

cyc,2-<;arboxylicacid)EF(l- 

cyc)EALEE 




angiokm 


NMFJNSJIQLGA 
m 






ap^onineA 


BFJCENFFMEK 
AFNN 


B(1-C(=0)C)F(1- 
C(=0)C(C)N(C)C)JCENFFME(1- 
methoxy)KAF(l- 
C(==0)C(N(C)C)COC)NN 


sea hare Aplysia 
kurodai 


apoptolidin 


EFLMNAMMM 


EF(methoxy-l ,hydroxy-2)LMNAMMM 


aiuachiiiB 


MLMLM 


MLMLM 




aurachinC 


MLMLM 


MLMLM 




A59770 


QQKQQOFCDN 


QQK(2-etiiyl)Q=QLJ(2- 

hydroxyl)F(2-0-glycosyl)CD(2- 

hydroxyl)N 


Amycolatopsis 
orientalis 


A82548A- 
cytovaricm 


QQQKQNUEDD 
N 






A83543A 


FLFQQQ 


FLFQQQ 


Saccharopolyspoia 
spinosa 


AB023a 


NNNNNRSSSLS 
R 


NNNNNRSSSLSR 




AH.758 


RSNMURMN 


RS(2-methoxy)NMlJRMN(2- 

metfaoxy) 




bafilomcinD 


BNHCENMKCM 
N 


BNHCE(l-macrocyc,2- 
methoxy)NMKCMN(2- . 
methoxy)d(keto-macrocyc) 


S.sp. 


bafilomycinAl 


BNHCENMKCM 
N 


BNHCE(1 -macrocyc,2- 
methoxy)NMKCMN(2- 
nMthoxy)0{keto-macrocyc) 
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Jr tJJU X IkCj X JLU JL 






oUUKC£ 
ORGANISM 


bonelidin 


SNMRUUUS 


SNM(2-N)RUUUS 




catyculin 


QFBBMnoNm 


QF(lHaieai(»cy)BBMmNm 


Discodenma calyx 


candicidin- 

candeptin-ascosin- 

Icvorbi-etc 


ODNNNNNNNF 
CEEUEE 


CDNNNNNNNF(l-glycosyl)C(l-0- 
cyc^-carbQ}cylicacid)EB(l-cyc)£(2^ 
hyiroxyl)UEB 

\ 


S. griseus, S. 
canescus,S. 
levoris, S. 
viridoflavus» Stv. 
grisoviridum 


candidin 


QDNNNNNNNF 
CEFELIEE 


la)MQNNNNNF(l^ycoqfl)^^^ 

cyc,2-carboxylicacid)EF(l-cyc)E(2- 

ltydroxyl)LIEE 




carboinycjn 


ENNCX)DF 


EN(l^-epaxy)NHOF(l.glycosyl> 
indhoxy)F(l-C(=0)C) 




caiboniycinB- 
magnamycinB 


FNNGOEP 


FNNGOF(l-glycosyU-xnettKncy)F(l- 




carboDoycin-A- 

inagnaHiycinr 

deltan^cinA4- 

NSC51001-PS97" 

3628-WC3628 


ENNHOFF 


ENNHCXmchides CX:H0)FF 




chalconiycin- 
mycononiycin- 
aldgpmycinDmiko 
nomycin 


BNNGKDN 


B(2-0-glycosyl)N(l,2-epoxy)NG(2- 
hydroxyl)KD(l"glyco§yl)N 


S. bikiniensis, S. 
albogriseolus 


chimeramycinB- 
PTL448 


BMNCODF 


B(0-ethyl)MNC(l-glycosyl)OD(l- 
gJycosyDF 


S. ambofaciais ka- 
448 


chivosazolA 


SRRnNNSRQNn 


Siai{l<)-niacro<o^c)iLNNSR{l- 

metiKMcy)QNnR(l- 

glycosyl)MnNnQ(keto-inacrocyc) 


S. cellulosum 


cineroznycinB 


BLME=JN 


BLMB=J(2-hydroxyl)N 


S. 

cinereochromogene 
s, S. sp. 


ctneromycinBdehy 
ro 


BLMI^JN 


BLMI=J(24tydroxyl)N 


S. grieoviridis 


cincroinyciiiB2,3di 
hyro 


BLME=JL 


BLME=J(2-hydroxyl)L 


S. grieoviridis 


cirramycinBldihy 
dioxy-A688SX 


BMNGODF 


BM(l^-epoxy)NGOD(l-glycosyl)F 


S. flocculus 


ciiramycinB- 

cirramycinBl- 

Acumycin- 

A688A-B58941- 

A6888A 


AMNGODF 


AM(U^xy)NGOD(l-glycosyl)F 


S. cirratius, S. 
griseoflavus, S. 
fi:adiae» S. 
flocculus 


cladospojiaeA 


ELLrN 


£LLF(2-hydroxyi)N 


Cladosporium 
fulvum, C. 
cladosporiodes 


cladospolideB 


ELLFn 


EtXF(2-hydroxyl)n 


Cladosporium 
fiilvum, C, 
cladosporiodes 


cladospolideC 


ELLEN 


ELLE(2-hydroxyl)N 


fungus 

Cladosporium 
trauissimimi 
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X^fVf WTPTTTli'l? 

JrUli X aJ!/ 1 JUiJli 






ORGANISM 


concanamycinA- 
lolunycm-Aool -1 - 

OilCA T* AXT1 '3T3T> 

X4357B 


CENMKCCMN 


CE(2-mcftoxy)NMKC(2- 
etoyl)CNlN(2-inetbo?gr} 


S 

diastatochromogen 
ese, S. sp» S. 
neyagawaensis 


CQncflDflniyciiiS* 
S45B 


CdNMKCCMN 


(^2-inemoxy>NMKCCMN(2- 
msthoxy) 


S 

diastatocbroiDogexi 
ese 


concananxycQiO* 

anhydrocoQcanain 

ycinB 


NBNHCENMKC 
CMN 


NBNHCE(2. 

inetnoxy)NMK<XMNp-methoxy) 




congloblatm 


AJM 


Ad-oxazoyDJMjUM 




ihcfjp&amycrn 


BMtSSSSSSRLR 
SRN 


RNRSSSS(l-0-cyc)S(2. 
hydK>xyl)S(l-cyc)RLRSRN 


S. hygroscopicus 


cytovaricin-H230 


QQQKQNLJFCD 
N 


QQQKQMJ(2-hydroxyl)F(2^- 
glyco5yl)CIX2-hydroxyl)N 


S. sp^ S. collinus 


cytovancinB 


QQKQMJFDDN 


QQQKQMJ(2-hydroxyl)F(2-0- 
elycosyl)CIX2-liydrDxyl)N 


S.tonilosus 


CP64537 


AROKDD 


A(2-glycosyl)R(l- 

C(=0)C(0XXQQG(2- 

hydr<NC^)KDD(l-glycosyl) 


Streptoxnyces 
toyocaensis 
humicola ATCC 
39491 


damavaricinC 


ODOCCNM 


QDQGCNM 


S. spectabilis 


deltamycinAl 


ENNGOFF 


EN(l,2-epoxy)NGOF(l-glycosyi;2- 
methoxy)F(l-C(=OX::) 


S. deltae» S. 
halstedii-deltae 


deltamycinX- 

desisovalerylcaibo 

mycinA 


ENNGOFF 


EN(U-epoxy)NGOF(l-glycoqfU- 
tnethoxy)F(l-C(=0)C) 


S. deltae, S. 
halstedii-deltae 


englcratnycin 


QNJHN 


QNJH(2-lvdraxyl)N(l>cpoxy) 


Engleromyces 
goetzei 


#epothilone 


MEMUDgE 


MEM(1 ,2-epoxy)LJDG(2-methyl)E 




eiythromycin 


ADGJDD 


A(2.hydroxyl)DGJ(2-hydroxyl)D(l-glycosyl)D(l- 
glycosyl) 


espiiiQmycinA2 


ENNCOFF 


ENNCOF(l-glycosyl,2-methoxy)F(l- 
C(=0)C) 


S. flmgicidicus 


filipinin- 
lagosinl4deoxy 


JiiNMJNNMHFVjfi*' 
EFFL 


E(2-hydroxyl)NNN>JMEFl'mFF 


S. Htipinensis, S. 
durhamensis 


fOipin-lagosin 


ENNNNMEFFFF 
FFF 




S. filipinensis, S. - 
durhamoisis 


fonnanucin 


CBNMOCMN 


CBNMO(mcludes a long, branched 
alkyl chain)CMN 




foromacidinB- 
spinmiycinB 


ENNAOFF 




S. ambofaciens 


FD891 


RRUSNNLNRM 

iVl 


RRUS(l-0-inacrocyc)NNL(2- 
hydroxyl)N(l ,2-q)oxy)RMMQ(keto- 
macrocyc) 


S. graniinofaciens 


FK895 


BNRNMRNRLS 


R(l-niethoxy)N(i;2-cpoxy)KNMR(l. 

0-macrocyc)NR(l-C(0)C,2- 

hydroxyl)LSQ(keto-inacrocyc) 


S. hygroscopicus 


FK-506 


MAEPMJ3BK0Q 






gedaxnycin 


JEJBNNNnNNNF 
AEFEEEEIE 


IEJBNh^NnNNNF(l-glyc()syl)A(l-0- 

cyc^-carboxylicacid)EF(2- 

cyc)EEEEIE 


S. aureofaciens 
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POLYKETIDE 


CEniCKLES 


annotated-CHUCKLES 


SOURCE 


geldanamycin 


QULRMRnM 


QUL(2Hncthoxy)RMS(l-CONB[2;i. 

mCulOAy^lIlM 


S. Iiygroscx>picu5 
var. gelanus 


gephyroDicacid 


RRTSRRM 






gloeospOfTone 




ELLE(l-0^)I(2-cyc;i-hydro3iyl)L 


CoUetotricfaum 
gloeosponoides f. 
sp.jussiaea 


GERI-155 


BNLGJDN 


B(2^-giycosyl)N(l>qpoxy)LG(2- 
nyaroxyl)JD(l-giycos5a)N 


S. GERI-15S 


haloxmchi 


QNCRCBNM 


QNC(l-methoxy)R(l- 
CfOKOCBNM 


Mic. halopliytica 


herbimyciii 


ALDMFdM 


A(l-inethoxy)L(2-mefhoxy)D(l- 

ine1lKixy)MF(l-C0Ma2> 

metiioxy)nM 




bygrolidm 


CENMJC3MM 


CE(l-0-macrocyc,2- 

methoxy)NMJCMMQOccto- 

macrocyc) 


S. bygroscopicus 


faygrolidtoroxo 


BNBCENMJCM 
N 


BNHCE(l-0-iDaciocyc^- 

mctiK>xy)lsIMJCMN(2- 

methoxy)Q(Keto-macrocyc) 


S.gnsei]SyS. 
faygroscopicus 


inimimnyciTi 


URRNSLMQQS 


UR(l-0-macrocyc)KNS(l- 

glycosyl)LMR(l^^c)=QS(2- 

cyG)Q(keto-inacxocyc) 




juveninucinAl- 

T1124A1- 

M4365A1 


BMNGEDF 


BMNGFDF(mcludes an efhyl in pos 
2) 


Mic. chalcea 


juveniniicinA2- 

T1124A2. 

M4365A2 


BMNGJDF 




Mic. chalcea 


]UvexmnicmA4- 
T1124A4- 
M43 65 A4 


BMNGODF 




Mic. chalcea-Mic. 
capillata 


juvenimicin-Tl 124 


BNNGJDF 




Mic. chalcea - 


kanchanamycin 


NMSSSSSSSRL 






lanKBinycin- 
kajimycin- 
landavamycin- 
A20338N2 


ADGJDD 




S. violaceonigcr, S. 
spinichromogmes 


leinamycin 


QNNIMLN 






leucaniddin 


CENMKCMN 


CE(2-inethoxy)NMKCMN(2- 
methoxy) 


S. halstedii . " 


leucoinycinA12- 
IdtasomycinAl 2 


FNNCODF 




S. Idtasatoensis 


leucoinfycinA14- 

IHtn CftTTivnin A 1 il 


FiiNCOFF 




S. kitasatoensis 


leucomycinA3- 
josamycin- 
platenomyc2nA3- 
turimycinAS 


ENNAODF 




S. kitasatoensis, S. 
hydroscopicius, S. 
narbonensis, S. 
platensis 


leucomycinAS- 
turiincinH4 


ENNCOFF 




S. kitasatoensis 


lieaomycin 


SSSNSSSTSML 
h3RR>JNNNNL 
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POLYKETIDE 


CHUCKLES 


annotated-CHUCKLES 


SOURCE 
ORGANISM 


lucdisomyciii,** 
etrusconrycin- 
lucimycin-FJl 163- 
butylpuiiaricui 


SNNNNSREEFN 
N 




Act. sp, S. luccnsis, 
S.glaucus 


L155175 


RSNMURinM 


RS(2-methoxy)NMURmM 




L681110 


NMURMN 


NMURMN(2-inethaxy) 




Li Wlwl l/wUX 

lactenocin 


BMNHODF 




K. pnetnn, B. 
subtly Shva.. 


inacrocm-Y07625 


OMNGODF 




S. firadiae gs 16 


maridomycml.. 


ENNCOPF 




S. paltensis, S. 
cq)UCIlsiSy S. 

tHCCTiocliFoniogcnc 
s 


maridomycm- 

turiiriyciiJBPS- 
B5050A-YL704- 5^ 


ENNODF . 




S. hygroscopicus- 

O. pialcQSlo* 

malvmiis 


maf'liA'nYvr'i'n A 
iiiauicuijroiiLfV 


MRMKNLRL 






midecamycinAr- 
piawsnuiuyciiio i* 


ENNCOFF 


£NNCOF(2-'inethoxy)F 


S. mycaro&dens 


niidecamycmA2- 

■mvH#*nflTm/pin AO^ 
iiijrUwuai.iijf i/iiLf\z>^ 

SF837A2 


FNNDJEE 


FNNDJE(2-hydroxy)E(l-C(=OX:C) 


S. mycarofaciens 


milbemycin 


QQQMKNOO 






fruKAla^UIIljClIl 


o OXSJvo o U JvlN o In. 

SMRMRNLRSL 
L 






mvcinamicin 


BNNGJDN 


BC I -cvc'iNNGJDN/'2-cvc'i 




TTivciTi iimi cin VT 


BNNGJDN 




iviicroixioiiospora • 
griscorubida sp. 


mvcinaniicinXl 1 


BNNGLDN 




pyogenes, 
Corviieba.cteriiim 


#niycolactone 


SRMUSMSSL^S 
MNMMN 


SRMUSMSSL.SSMNMMN 




mycolactoncA 


SRMUSMSSL 


SRMUSMSSL 




mycolactoneB 


sSMNMMN 


sSMNMMN 




myxovirescinAl 


QQEFLNNLLIL 
KJ 






myxoyn:csciiiA2 


QQEFLNNLLEJ 
J 






rayxovirescinB- 
megovalicinB 


QQEFLNNLLIL 

KM- 






myxovirescinC-Cl 


QQEFLNNLLLL 
KJ 






myxovirescinD 


QQEFLNNLLLL 
KM 






myxovhresciiiE 


OOEFLNNLLILJ 
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ORGANISM 


myxoviresdnFl 


QQEFIJNNIXLL 
KJ 






iiiyxoviresdoF2 


QQEtLNNLLLL 

JJ 






myxoviresdnOl 


QQEFLiNNLLLK 
J 






myxoviiescinGl 


QQEFLNNIXLJJ 






myxoviresctoHl 


(^EFLNNLLIL 
KJ 






myxoviiescm]H2 


QQEFLNNLUU 
J 






myxoviresdnL 


QQEPLNNLUL 
HJ 






myxoviiescinPl 


QQEFLNNLLE* 
LKJ 






inyxovirescinP2 


QQEFLNNLLIL 
UJ 






myxoviresdnQ 


QQEFLNNLLIL 
KM 






myxoviresdnS 


QQEFLNNLLIL 
HJ 






M4365G2 


6MNOODF 




StreptoverticQluin 
Idtasatoensis, S. 
th^maotoleram 


nancimycin 


QNDACBNM 




S. albovinaceus 


neocopiamyciii 


NRSSSSSRLRSR 
N 






niddamycin- 
F3463- 

Bdesacetylcarbomy 
cinB 


FNNGOEF 


FNNGOF(l-gIycosyU-inelhoxy)F 


S. aureus, S. hitea, 
B. subt 


oligomycmA 


QJNNJRGAGAN 




diastatochromogea 
es, S. chibaeusis 


oligomyciiiB 


QINNKCHDHD 
N 




S. 

diastatochromogen 

es 


oligamycinB- 
44homo 


QJNNJRGRGAN 




S. bottropensis 


oligomycinD 


QJNNJBGAGAN 




S. arabicus, S. 
parvulus, S. 
nitgersensiSy S. 
griseus, S. 
aureo&ciens 


ossaxnycin 


QQQNLLKFDD 






perimycin 


JBNNNnNNNFC 
EFEEEEIE 


JBNNNnNNNFCEFEEEEIE 




phenalamid 


nCNNNNM 






phenalamideAl- 
fenalaimdJ-102-C 


JMCMNnNM 






phcnalainideA2- 
102-T 


JMCMNNNM 






plienalaimdeA3 


JMCMNNnM 
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1^ JUL U AJLiJB«0 




tf^ A 'MYC'AA' 
UKIjrAINldAl 


■ — 

pucnmsmiufiB 


JMCJViNtlyNM 






phcnslsDcndcC 


JMCMM^NM 






phthoTamycin 


OONLUSRRN 






pikromycin 


ANGJDH 






prasmoD-LlSSlTS 


DNMJCMM 




S» faydroscopicus 
ma 5285; S. 
prasinus 


protostrq)tovaricm 


QDACDNM 






PD118576A2 


ENMJCMN 




S. sp. wp 3913 


PF1163 


EJLLQ 






#quinolidoimcin 


SSLULSSNNNSI 
SNNSSSUSRRQ 
QRRSS 






r^ainycin 


FGMEGJNNME 
EKQQ 






rhiziopodin 


RLQSNNSS 






ri&inycin 


QOQNDACBNM 






rimocidin 


BNNNNFCEFEL 
lA ' 






rosamicin- 
Tepronucin 


BMNGODF 






rosarainyciii 


BMNGODF 




Miaromonospora 
rosana 


rustmicin 


QQQJMOG 


QQQJMO(includes CGH)G 




rutaniyciii 


QQJNNJBGAGA 

N 






scytophycin 


BFCEENQEMN 






scytophycinB-E 


BFCEQNOEMN 






shuiiniycm 


NNRSSSSSSRLR 
SR>3N 






sorangicin 


LKMFnENLNCF 
DFNQNNnn 


LKMF(2- 

hydroxyl)nEl>njsrCFDFNONNim 


■|' 


sorangicinA 


QNLNF 






sorangicinB 


NLNF 






sorangolideA 


IXLLKBUMEA 
FM 




nQ^obacterium 

soraxigium 
cellulosum 


sotaphen 


EUFJDFA 






spiramycin 


nNCJDF 






staphcoccomycin- 

angolamycin- 

shincotnycin 


CMNGODF 






sttpiamide 


JMUlNNnNM 






tartrolonB2 


nNLFHE 


nNLFHE 


fragment 


ODUaliOllUC/ 




J vjmjuyiL/rit' tz-ny oTOxy 1 ) 




tiiiazmotnenomyci 
n 


QLmRS>JNNS 






tiacumicin 


SMMRMSNM 






tylosinC-macrocin 


EMNHGDF 






tylosin-A 


EMNGODF 






TAN-1323 


NMUREMM 






TMC-34 


NRSSSSSSRLRS 
RN 
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POLYKETIDE 


CHUCKLES 


aimotated4:::HUCKLES 


SOURCE 
ORGANISM 


venturicidinA 


CKNFLMQQO 






vicenistatin 


LNMULRNN 






yuazxaziiy^cixiB- 

vinistDniyciii- 

TAN1323C 


QNMKCCMN 




S. sp. ch4] 


zmcdphofin* 
Sriseochelm 


KMCNIACC 




S.griseus 



In another embodiment, fhe pol^etide library includes the name of the pol^/ketide, the 
CHUCEXES string and a linearized rq^re^entatim of the stnic^^ The linearized 
lepres^tations of the CHUCKLES structures for erythromydn and q)othilQne are as follows: 




An illustrative example of a polyketide library containing linearized rq)resentations of their 
stnictures is found in A|)pendix C (deposited in the inicrofiche appendix). 

EXAMPLE 3 

Alternative PKS Genes for Enothilone 

This example illustrates the aUgnment and design of novel PKS genes for the target 
epothilone. Epothilone is first converted into CHUCKLES string format and then read into the 
MORPH program as a TARGET. The program then generates all possible alignments of Ubrary 
modules and sorts tiie alignmentsr to determine preferred combinations of modules for gene 
construction and production of epothilone via a novel polyketide synthase gene. 
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The epotiulone D structure above was fiist opeaed at die xnactolactone ring closure 
between the C-l-ketone and the C-1 5-05qrgen. The monomer set shown in Kgurc 2 was ttien 
matched against eacb of Hxe successive pairs of macrocycUc backbone carbon atoms, starting 
with C-2 and C-3, which match monomer B. The next two carbon atoms C-4 and C-Smatch • 
monomer G with an additional post-synthetic meOiylaiion on C-4. 06 and C-7 match monomer 
D. C-8 and C-9 match monomer J. OlO and C-1 1 match monomer L C-12 and C-13 match . 
monomer M. C-14 and C-IS, where C-15 has a hydroxy substitution (modified by thioesterase 
to close the macrocycleX match monomor E. C-16 and C-17 match monomer M. 

The rest of the molecule, a methyl-substituted thiazole moiety, does not match any of the 
monomers in the monomer set. This moiety corresponds to a malonyl CoA loading module and 
an NRPS module that together generate the methyl-substituted thiazole moiety. This moiety is 
thus omitted from the CHUCKLES string generated from this illustrative monomer set but can 
be added simply by adding a monomer to the set. The CHUCKLES string generated is 
EGDJLMEM, which is in the reverse order of biosynthesis. This sequence is then reversed to 
MEMLJDGE to yield a monomer sequence that matches the order of biosynthesis. The 
sequence is then annotated to account for the post-synthetic modifications as follows 
MEMLJDG (2-methyl)E. 

This target sequence is provided to the MORPH program to generate all possible 
combinations of modules in die CHUCKLES-encoded fibrary that will yield the target 
CHUCKLES. The vaUd combinations are then sorted in increasing order of non-native inter- 
module interfeces. In one implementation, a MORPH run generated 3,452 valid sequences of 
five inter-module int^i&ces. Of these, none contain fewer than five inter-module interfaces. 
Some illustrative sample module combinations appear below. The combinations are shown 
listing each monomer followed by a colon and the name of flie polyketide(s) from which it is 
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derived, followed by a parenthetical showing the associated monomers in that pol^etide. 
Vertical lines represent modular junctions betwem two di^rent pol^etides. 

DlustratiyePKS Gene 1: 

M:3ac^4''butylryltylosin(FMN) | E:tedanolide(GEH) | M:aIdgamycin(BML) 
L:aldgamycin(NfLG) | J:aldgamydn(GJD)D:aldgamycin(JDL) | G:tedanolide(J6E) 
E:tedanolide(GEH) 

Illustrative PKS Gene 1 thus comprises one or more open reading frames that encode, in 
the order listed, the module from the acetji-4*'-butyryltylosm PKS fliat corre^onds to monomer 
M, the module from the tedeanolide PKS corresponding to monomer B, the modules from the 
aldgamydn PKS cotresponding to monomors M, L, J, and D, and the modules from the 
tedanolide PKS corresponding to monomers G and E. 

Illustrative PKS Gene 2: 

M:aIbocycline-Ml-ingramycm-TA2407-<aneiomycii^ E:albocyclin&. 
Ml- ingramycin-TA2407-cineromycinB-U28010-SR2077(MEJ) | M:albocycline-Ml- . 
ingramycm-TA2407-cinax)mycinB-U28010-SI^077(LMB) | Lralbocycline-Ml- ingramydn- 
TA2407-cineromycinB-U28010-SR2077 (BLM) | J:erytbromycin(G]D) D:erythromycin(IDD) | 
G.'tedanoKde(JGE) E:tedanolide(GEH) . • 

EXAMPLE4 

Alternative PKS Genes for 6-Deoxvervthronolide B 

This example illustrates the alignment and design of novel PKS genes for the 
erythromycin basic pol}^etide structure (6-dEB) using die MORPH program. 
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For the 6-dEB structure above, the CEKJCKUBS string is generated by first opening Hhe 
macrolactone ring closure between the Ol-ketone and the C-IB- oxygen. Using flie monomer 
set and matching protocol described in Example 3, one generates the CHUCKI£S string 
DDJGDA, in the reverse order of biosynthesis. This sequence is then reversed to ADGJDD to- 
yield the monomer sequence that matches the order of biosynthesis. The sequence is then 
annotated to account for the post-synthetic modifications (erythromycin A) as follows A{Zr 
hydroxyl) DGJ(2-hydroxyip(l-glycosyip(l-glycosyl). 

This target sequence is supplied to the MORPH program to generate all possible 
combinations of modules in the CHUCKLES-encoded library. The valid combinations are then 
sorted in increasing order of non-native inter-module interfaces. In one implementation, a 
MORPH run generated 19,631 valid sequences of less than or equal to five inter-module 
interfaces. Of these, 13,306 contain 4 mter-module interfaces, and 256 contain only 3 inter- 
module interfaces. Some of these contain only two inter-module faces, and one only contains 
one. Some illustrative sample module combinations follow. 

Illustrative PKS Gene 1: 

A:amphotedcinA(EAL) | D:aldgamycin(JDL) | Gnnycmamicin(NGJ) Janycinamicin(GJD) 
D3nycinamicin(JDN) | D:amphotericinA(CDN) 

Illustrative PKS gene 1 thus conq>rises one or more open reading fi^es that encode, in 
the order listed, the aniqphotericin PKS module coiresponding to mcmomer A, the aldgamycin 
PKS monomer corresponding to monomer D, the mycinamicin PKS modules corresponding to 
monomers G, J, and D, and tibe amphotericin PKS module corre^onding to monomer D. 
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Illustrative PKS G^e 2: 

A:anq)hoteiicmACEAL) | D:aldgamytnii(JDL) 1 6:pikromycin(NGJ) J:pikromycm(GJD) 
D:pikramycin(JDH) | D:aldgamycm(JDL) 

niustiative PKS Geae3: 

A:Iaiikamycm-kujimycin-lan(kvamycm-A20338N2(-^ Dzlankamycin-kujimycin- 
lan<kvamycin-A20338N2(ADG) G:lankamycin-la^^^ 
J:lankamycin-kujimydn4andavamycm-^ 
D:ossamycin(PDN) 

Illustrative PKS Gene 4: 

A:anipliotericinA(BAL) | D:Iaiikamycin-kujimycin-landavamycin--A20338N2(ADG) . 
G:laBkamycin-kujimycin-landavamyciii-A20338N2(DGJ) Jrlaiikamyc^ 
landavamycin-A20338N2(GJD) | D:A82548A-cytovaricmOFI)D) D:A82548A- 
cytovaricin(DDN) 

Illustrative PKS Gene 5: 

A:laiikamydn-kujimycin-landavamyciii-A20338N2(-AD) Dilankamycin-kujimycin- 
landavainycin-A20338N2(ADG) Glankamycin-kujimycin-laii^^ 
J:Iankamycin-lcujimy(±ai-landavamycin-^ Ddankamycin-kujimycin- 
landavamycin-A20338N2(JDD) D:laDkamycin-kujimycm^^ 

EXAMPLES 

Source Code: 
include <Stdio.h> 

/* '-siani/programs/moiph/morph3.c 

PURPOSE: To traverse recursively all the entries in PKS.lib, 
generating all feasible combinations of PKS modules to make the TARGET (e.g., q)othilone). 
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INPUT: libiaryfile: tab-delimited CHUCKLES-coded pol^etides file with the 
followmg cohnxms: 

1. pol:yketideiiaiiie 

2. plain CHUCKLES 

3. annotated CHUCKLES (contaiiis mfoimation about post- 
synthetic modificatiQns) 

4. source organism; 
targetname: user-defined name (e.g., epoD); 

targetsequence: CHUCEXES-coded polyketide of desired TARGET (e.g., 

MEMUDGE); 

X;Y, Z sets of wildcards: sets of monomers for particular positions 
zppesxmg in target sequence (the wildcards can be used for analoging the TARGET pot^etide); 
hard-coded parameters which may be reset (requires recompiling): 

NBOUNDARY_CUTOFF determines the maximum number of 
non-native inter-modular interfeces which are contained in the output (set to 5, but may be 
increased or decrease(0;'and i - 

. RECURSION_COIJGSrrER_CUTOFF specifies the number of 
levels of recursion (defaults to 0, 1, 2) acceptable for the run — a large PKS Hbrary can cause 
recursion that will greatly increase run time; because of the multi-directionality of the 
alignments (using every library entry as a STARTER), there is typically no need to go beyond 2 
levels of recursion. 
OUTPUT: 

All combinations of modules that meet parameters set by user. Example 
ou^ut J5om MEMUDGE (epothilone D) using a subset of a PKS Ubrary is provided below. • 
Vertical bars indicate non-native inter-modiilar interfaces. Last column contains the number of 
"pieces** that are needed to put together fiie PKS. 

Names of PKSs have been abbreviated to fit them in these 

comments. 

HTT M:3atylCFMN)| E:tedan(GBi^| Mraldga^BML) L:aIdga(MLG)| 

J:aldga(GJD) D:aldga(JDL)| G:tedan(JGE) E:tedan(GEH)| 5 

HTT M:albMl(LMB) E:albMl(MEJ)l M:albMl(LME)| L:aldga(MLG)| 

J:aldga(GJD) D:aldga(JDL)| G:tedan(JGE) E:tedan(GEH)| 5 

HTT M:aIbMl(LME) E:aIbMl(MEJ)| M:aldga(BML) L:aldga(MLG)| 

J:aldga(GJD) D:aldga(JDL)| G:3atyl(NG0)| E:albMl(MEJ)| 5 

HTT M:albMl(LME) E:albMl(MEJ)| M:aldga(BML) L:aldga(MLG)| 
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J:aldga(GJD) D:aldga(JDL)| G:aldga(LGI)| B:albMl(MEJ)| S 
HIT M:albMl(LIV(E) B:aIbMl(NdEJ)| M:aldga(BML) L:aldga(MLG)| 
J:aIdga(GJD) D:aldga(JDL)| GifadgaCLGJ)! E:a]bMl(MEJ)| 5 
USAGE: 

moiphS 4 libiaryffle -n targetname -t targetseqa^ce [-x X-wildcaids] [-y Y- 
wfldcards] [-zZ-wildcards] 
examples: 

# generate conabinations that yield qK)tbilone D 

%mozpb3 -1 PKS.lib -n epoD -t MEMLIDGB > omoiphS^^D 

%egrqj HIT omaiiA3_q)oD I sort I miiq I sort +10 
omQiph3jBpoD.uniq.sort 

%egrqp ALIGNJARGBT omorph3_epoD > 
omorph3_qK)D_STARTBRALIGN . 

. # generate conibinations that yiddepotfailoneDiviih a 
C13-hydioxyl ' 

%moiph3 -I PKS.Kb -n epoD-lSOH -t MEXLIDGE -x ABCD > 

oepoD-lSOH 

%egrep HIT oepoD-lSOH | sort | imiq | sort +10 -1 1 > 
oepoD*130H.imiq.sort 

%egrep ALIGN^TARGET oepoI>-130H > oepoD-130H_STARTERjy.IGN 

# generate combination that yield epotbilone with the 
following wildcards (set 1) 

%morph3 -1 PKSJib -n epoD-setl -t MEXYZDgE -x ABCD -y LEFIN -z 
JACGM>o€poD-.setl 

%grep HIT oepoD-setl | sort | imiq | sort +10 -1 1 > oepoD-setl.uniq.sort 

# generate combination that yield epofliilone with the following wildcards (set 2) 
%morph3 -1 PKS.lib -n epoD-set2 -t MEXYZDgE -x JK -y EF -z JACGM > 

oepoD-set2 

%grep HIT oq5oD-set2 1 sort | uniq | sort +10 -1 1 > oepoD-set2.miiq.sort 
UMTTATIONS: 

This vmion does not handle intra-modular modifications/splitting because 
morph is operating at the monomer level. Modifications could convert the CHUCKLES-encoded 
strings into the corresponding and equivalent SMILES and then perform more complex 
chemical analysis of the PKS molecular graphs. 
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Cuirently; inter-modular double bonds are present in Has fibraiy, but are ingnored by the 
mraph program. 
•/ 

#include<stdioJi> 

/* ~siani/programs/morpfa^aiph3.c 
•/ 

#d6fineTRUE 1 
#defineFALSE 0 
#defind3EBUG_MATCH FALSE 
#defineDEBUO_STARTER FALSE 
#de&ieDEBUG_ALIGN FALSE 
#defineD£BUG_KECmSE FALSE 
#defineDEBUG_WILDCARD FALSE 
#dej5ne]VL\XLEN 80 
#defineMAX_TYL_LEN 6 
#defineMAX_EPO_LEN 6 . 
#defineMAXNAMELEN 160 
#defineMAX_LIB_ENrRIES 500 
#defineMAXWILD 3 
#defineMAXBUF 200 
#defineNBOUNDARY_CUTOFF 5 
#defineRECURSION_COUNTER_CUTOFF 2 
#defineSTARTER_]\ffi<IIMUM_ADJACENT_ALIGN 2 
#defineMINIMUM_ADJACaBNT_ALIGN 2 
typedef struct Jib { 

dm name[MAXNAMELEN]; 

char monomersequence[MASNAMEIJSN]; 

diar annotatedseqttence[MAXNAMBLEN|; 

char alignedsequence[MAXNAMELEN]; 

char aUgDedPKSiiame[MAKLEN][MAXNAMELJaN]; 

int boiindaiytoright[MAXNAMELEN]; 

int maiiced[MAXLEN]; 

char context[MAXI£N][4]; 

int recursion_tagged; 

int nboundarjr. 
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}LE; 

mainCint argp, char **argv) 
{ 

int ii=0, jj=0, Mc=0. 11=0; 

int nlib=0; 

int eooiuxM); 

int nfilIedN),nfiIledmax^; 

int epoihilcmeleiFO; 

int Dlargpstpiece=0ytnlargestpiec6=K); 

int nmpass^); 

int . Icomi1r=0; 

int newjminaiked^entriesjGlledN); 
int recuision^counter = 0; 
int nwildcard=0; 

int best_new_umna&ed_entries_filled = 0; 

int smallest_accq)tablej)iece = 0; 

int current_nniarked==0,previoiis_nmariced===0; 

char *sptr, ♦eptr, *Iptr,*buJ5)tr; 

char *clibptr; 

char *libraryfile; 

char *targetsequence,*targetname; 

char bufDMAXBUF]; 

char wildcanls|MAXWIIJ)]|>lAXLEN]; 

FILE ♦Ubp; 

LIB epotemp; 

LIB hTjrary(MAX_LB_ENTRIES]; 

LIB epotfailone; 

char '^progname; 

char **fiielist, **fileptr; 

hT)raryfile = ""; 

targetsequence = 

taigetname = 

for(ii=0; ii<MAXWILD; ii++) { 

for(u=K);i]<lVlAXLEN;jb'H-H) { 
wildcards[ii][j|j] = ^0•; 
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} 

} 

process arguments ***/ 
filelist « fileptr « (char **)(nMdloc(argc • sizeo^^argv))); 
progname = *argv++; 
ifl[argc<2){ 

^rintiQ[stden;'\isage:%s 4 libraryfile -n targefiiame -t targetsequence [-x X- 
wUdcards] [-y Y-wiMcards] [-zZ-wildcardsJVa^progname 
exitO; 

} * 
while(argc-> l) { 

if(argv[0][0] = && argv[0][l] { 
handle option */ 
*-H-(*argv); /* advance past the minus */ 
switch(**argv) { 

case T: /* get Kbrary input filaiame CPKS.Kb) */ 
argvf+; argc-; 
libraryfile = argv[0]; 

§)rintf(stderr,"-l: Kbraryfile=%s\nMibraryfile); 
break; 

case V: /♦ get target name string */ 
argv++; argc-; 
targetname = argv[0]; 

^printf(stderr,"-t: targetname=%s\n",targetname); 
break; 

case V: /* get target sequence string */ 
argv++; argc-; 
targetsequence = argv[0]; 

targetsequenceF=%s\n'*,targetsequence); 

break; 

case ¥: /♦ get a wildcard string */ 
argv++; argc-; 
strcpy(wildcards[0],argv[0]); 
§>rintf(5tdOT,"-x: 
wildcard5[%dHis\n%0,wildcards[0]); 
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nwOdcarctH-; 
brealq 

case y: get a wildcard string */ 
argv++; argc-; 
sticpy(wildcaxds[l]»aigv[0]); 
^rinti(stdenry''-x: 
wildcards[%d|=^\n",l,wildcards[l]); 

nwildcardH-; 
break; 

case 'z': I* get aivildcard string *l 
argv++; argc-; 
strcpy(wildcaids[2},ai:gv[0]); 
Q)rintf(stden:,''-x: 
wadcards[%d]^/os\n"Awildcaxds[2]); 

nwildcard++; 
break; 

case *X*: /* get a wildcard string */ 
argv-H-; argc-; 
strcpy(wildcards[0],argv[0]); 
^rintf(stderr,"-x: 
wildcards[%d]=%s\n",0,wildcards[0]); 

irwildcard++; 
break; 

case V: /* get a wildcard string */ 
argv++; argc-; 
strq)y(wildcards[l],argv[0]); 
§)rintf(stdeiT,"-x: 
wildcards[%d]=%s\n",l,wiIdcards[lD; 

nwildcard++; 
break; 

case /* get a wildcard string ♦/ 
argv++; argc-; 
strcpy(wildcanls[2],aigv[0}); 
fyrintfl[stdaT,"-x: 
wildcards[%d]=^\n"Awildcards[2]); 
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nwildcard-H-; 
break; 
default: 

fyriiitf(stdecr»''%s unknown optiotu 

ignared\n",*argv); 

}/*switdi*/ 
} else {/* a regular filename */ 
♦filq)trH- = *argv; 
*fileptr=NULL; 

} 

argv++; 

}/*while*/ 
ifl[nwildcani>0){ 

for(ii=0; n< nwildcard; ii++) { 
^rintf(stderr,"wndcanis[%d]==%s\n",u,w^ 
$imtf(stdout,Vadoards[%d]==%sWyi,wildcards[n^^ 

> 

} 

q)othilone Jiboundary = 0; 
for(ii==0; ii<MAXNAMELEN; ii-H-) { 
epothilone.name[ii] = ^0*; 
. epothilone.monomersequence[ii] = ^0'; 
epothilone.alignedsequence[ii] = ^0*; 
epothilone.boundarytoright[ii] = TRUE; 
for(ij=0; ij<MAXLEN; jj++) { 

epothilone.alignedPKSname[jj][ii] ~ ^0'; 
epothilonejnaikedOj] = FALSE; 
epothilone.context[i3][0] = ^0*; 
epothilone.context[ij][ 1] = ^0•; 
epothilone.context[D][2] = W; 

} 

} 

strcpy(epothilcme.iiame,targetname); 
strqpy(epotbflcmejnononiersequence»targetsequence); 
§)rintfl;stdout, "TARGET: %s\n", epothilone.monomersequence); 
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ecoiiiit=(H 

eptr » epotbilone jDononiersequenc^ 

while(•eptrl«^0'){ 

if(ecount = 0){ 

epofhilQne.context[ecount][0] - *- ; 
epotihilaDie.context[ecouixt][l] = ^eptr, 
€pofiiilo]ie.cozitext[ecount][2] = *(eptr + 1); 

} dse { 

ijB[ecouiit = 
(strlG3i(epolM(me.mo3iomersequence) - 1)){ 

epotfailone.cantext[ecoiiiit][0] = 

*(eptr-l); 

q)othilone.caatexl[ecoiint][l] =» *cptn 
e|)otfailone.coiitext[ecoiitxt][2] = '-*; 

} else { 

qpotfailon6.cont6xt[ecoiiiit][0] - 

*(eptr-l); 

epothilone.cont^t[ecoimt][l] = *eptr; 
epotiulone.contexttecount][2] = 

♦(eptr+1); 

} 

} 

epothilone.context[ecount][3] = ^0*; 

eptrH-; 

ecoimt++; 

} 

for(ii=0; ii<iecoTint; ii++) { 

^rintf[stdout,"(%s)\n\q)oMone.context[ii]); 

} 

/* libraiy */ 

nlib = get3*brary(]ihraryfile Jibraiy); 

§irmtf(stdout,"nlib=%d\a"^ib); 

kk=K);wMe(kk<iilib){ 

/* zero out the epottulone entry with respect to a new alignment */ 

for(uM); ii<MAXNAMELEN; ii++) { 
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q>ofhilo[ne.aligaedseqaence[ji] » ^0'; 

q)o1hiIon6.boimdaiytoright[ii] - TRUE; 

foi(iH;jU<MAXLEN;ij++) { 

epotbUaa&aligaedPKSname|j]][ii] = 
epGtiiilanejnarkedQj] = FALSE; 

} 

} 

leset Ihe context back to that in q)othilone- 
ecount^O; 

eptr = q>otbilQnejnonomCTequeiice; 
wlule(*q)tr!=^O0{ 

if(ecoimt = 0) { 

q)othaonei.context[ecount][0] = •-*; 

epotiiilone.context[ecouiit][l] = *eptr; 

epothaone.context[ecount][2] = ♦(q>tr+ 1); 

} else { 

if(ecoiint = 
(strlen(epotibilone.inonomersequeiice) - 1)){ 

6pothilone.context[ecount][0] « 

*(eptr.l); 

epotliilone.context[ecount][l] = *eptr; 
epothilone.context[ecount][2] « *-*; 

} else { 

epothilone.context[ecount][0] = 

*(eptr-l); 

q)o1hilone.ccmtext[ecouiit][l] = ♦eptr; 
epotbiloiie.CQlitext[ecoimt][2] = 

*(eptr+l); 

} 

> 

q)ofhilQne.context[ecount][3] = ^0'; 

eptrH-; 

ecount-H-; 

} 

/* iiign STARTER (cuirent library entry) and 
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q)OtfailQne */ 

qptr = libxaiy[]d£]jiKmomersequence; 

lcount=0; , 

^?l*fle(♦sptr!=^07{ 

i|)imtfl[stdouV'library[%d]jnQnomersequeiice[%^^ 
kk,lcotmt,]ib]:a]7[kk].m(momersequffl 

sptrH-; 
lcoiint++; 

} 

/* Call maximal_adjacent_aKgnmeiit until it no longer 

returns more than two adjacent modules. There is no reason to - - 

try to extract individual triodules, because this is done as part of the recursive filling of spaces* 

fiom the library. 

*/ 

smallest_acceptable j>iece = 2; 
eptr = epothilone.monomersequence; 
^rintf(stdout, "ALIGN_TARGET: "); 
while(♦eptr!=^0'){ 

fyrintf(stdout,"%c",*eptr); 

eptrH-; 

} 

j5)rint^stdout,"\n"); 

ji)rintf[stden:,"aligning %d %s\a",kk, library[kk].name); 
bestjttew_unmaaiced_entries_filled = 0; 
while((new_umnaiked_entries_filled = 
inaximal_acyacent_aUgDmentjand_dump(&epothil^^ 

inallest_acccptable_piece)) >= STARTERJ^flNlMUM^ADJAC^^ { 

if|[bestjiewjinmarked_entries_fiUed <newj^ 
best_newjinmaikedjentries_fiUed = 
newjmmaiked_entries_j5IIed; 

} 

iflpBBUG^STARTER) §)rintfl:stdout "STARTER ALIGN: 
newjnHnatkedjBntries_fiUed==%d\n"^ewjunm»^ 

epotbilonelen = strlen(epothilonejnonomersequence); 
for(ii=0; ii< epotbilonelen; ii++) { 
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ifl[DEBUG_STARTER) i|jriiitf(stdoiit, "STARTER AUGN:fouiid 

abestalignineDt 

betweea qpojn(momer[%d]==%c in lihrary[%d] J3am&=%s\D^ 

ii,qpoMonejnonometsequmce[ii]J^c»^Mone.aUgae^ 

^ } 
} 

fihraryDdc] jworsionjagged ^ 
^tfl[stdout,''AUGN.TARGBT: \n"); 
duiiq)_STARTER_aUgn(q>otfailone,nwildcard,wU 
^rintfl[8tdoTit,'»AUGN.TARGBT; \n"); 
ifCbest:;_newjmmaiked_entries^ <^ 1) { 

i5)rmtf(stdout,"ALIGN.TARGBT: PROBLEM 
best jiew_inimarked_raatries_filled = %d\n",best_newjmmarked_entriesj511ed); 

^rintf(stdout,"AUGNTARGET: PROBLEM skipping this STARTER 
enliy for lihrary[%d] Jiaine===%s\n", 

kk,lihrary[kk]jiame); 
library[kk] jrecursionjagged = FALSE; 
kk++; 
continue; 

} 

/* - fill in the gaps firom the Ubrary- */ 

/* generate a firesh copy of epothilone in epotemp */ 
epothilonelen = strlen(epothilone.monomersequence); 
nfilledmax = strlen(epothilone.inonomersequence); 
^>rintf(stdout, "nfilledmax=%d\n", nfilledmax); 
reset_epotemp(&epotemp,epothilone); 
nfilled = 0; 

for^=0; ii< q)ottulonelen; ii-H-) { if(epotemp jnaiked[ii] = TRUE) nfilled-H^ } 
i£CDEBUG_STARTER) j5)rintf(stdout,"nfilled from STARTER=%d\n"^led); 
&r(nunpass = 0; nimpass < nlib; mnq>ass++) { 

ifl[nmipass = kk) { contimie; } 

reset_qx)temp(&epotemp,qpothilone); 

nfilled^O; 

for(ii=0; ii< epofhilonelen; ii-H-) { 
i£(epotemp.marked[ii] = TRUE) nfiUed+f; } 
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if(iifill6d nfiUedmax) { 

ou^utJB:eshjBUgDment(&q;)otemp); 

}else{ 

buirent jomaiked ^ nfilled; 

previous jnxiaiked ^ nfiUed; 

smallest_acceptable_j>iece'= 1; 

while((newjiiimarked_en1ries_fiU^ - 
inaximal_adjacent_aUgQmeat(&epot^i^ 
acceptablejiece)) >« MIbnMUM_ADJACEOT_AU(a^ { 

curreat_nmarked += 

newjaxmiarked_entries_filled; 

ifpEBUG^MATCH) l5Hintf(stdout, 
"main; Teciusjon_level=%d,ixxmpass==%d,previoii5_^^ 
ciirrentjaiiiariced=%d\ii", 

re(Hnsion_counter,mmpass,previousj^ current_iimaike<0; 
} 

nfined = 0; 

for(ii=0; ii< epothilonelen; ii++) { 
ifl[^oteiiip jnarked[ii] = TRUE) nfilled-H-; } 

if(nfilled>=nfilledmax) { 

output_fresh_alignment(&epotemp); 

continue; /* no need to recurse */ 

} 

ifpEBUGJMATCH) j5)rintf(stdout, "main: 
about to RECURSE: nimpassF=^\n",mmpass); 

libiaiy[mmpass] j:ecuision_tagged = TRUE; 

recursionjcounta++; 
]miise_througb_&e_UbiaTy(nfilledmaX)epotemp,&epo^ 
nUbJiibiaiy,&:eGursionjDounter); 

libraiy[mnipass].recursion_tagged = FALSE; 

recursionjcounter-; 

} 

} 

libiaiyllk] jecuxsianjagged = FALSE; 
kk-4+; 
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}/*nm)*/ 
}/*main*/ 

int TecursejdiioughJheJibiaiy( 
int nfilledmax, 
LIB epotecap, 
LIB '^'epoMone, 
int nwildcard, 

diar wil(kaids[MAXWILD][NfAXI£N], 

int iilib, 

LIB *libraxy, 

int *recui5ionjcoiintcr) 

{ 

int ii==0; 

int ecounlr=0,elen=0; 
int nimpass=0; 
int nfiUed=0; 
int IcountK); 

int previous_nmaiked=0, current_nmarked=0; 
int smallest__acceptable_piece = 0; 
char *eptr, 
char *clibptr; 

char boimdary[MAXNAMELEN]; 
int new_unmarked_entries_filled==0; 
LIB epotemp_temp; 

if(DEBUG_MATCH) fyrintfl[stdout,"KECURSE: rBcwr5ion_counter==%d, 
nlib==%d\n^*recIlraon_colmte^,^^ 

elen - strlen(q>otemp jnonomersequence); 

nfflled = 0; 

for(ii=0; ii< elen; ii++) { if(epotemp.markcd[ii] «= TRUE) nfillcd++; } 

previous jimarked = nfilled; 

currentjomaiked = nfilled; 

smallest_acceptablej)iece = 1 ; 

if(nfilled>=nfilledmax) { return 1; } 

for (mmpass = 0; mmpass < nlib; nmpass++) { 

if(*recursion_counter>«RECURSION_COIJNm { 
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return 1; 

} 

ifCDEBUG_MATCIO i5mntfl[stdou^'MCl^ 
ie(nirsioxijcoiinter==%d, ixmq>ass==%d\n",*re(niisicmjcountCT 

if[libraiy[miiq)ass].recii^ { 
ifpEBUG J^TCM) ^mntlCst^^ 
Hbra]7[%d].recursion_tagged=<r^^ dcq>pj]ig\n'',xnmpass); 
contmue; 

} ^ 

ieset_epotenip(&epotemp_tei]^ 

elen strlexi(q)otempjteiiq) jootonomeise^ 

n£iUed = 0; 

for(ii==0; u< elen; ii++) { if(q)otemp_tempjnarked[ii] = 
TRUE)nfiHed-H-;} 

previous_nmarked = nfilled; 
cxirrent^nmarked = nfiUed; 
^^^e((new_uninarked_entries_filled = 
maximd_adjac©Qt_aligninent(&epoten]|)_temp, nwildcard,wildcards,library, 
imnpass,sinallest_accqptable_piece)) >== 1) { 

cun:ent_mnarked += new_immaiked_entries_filled; 
if(DEBUG_MATCH) i5)rintf(stdout, "RECURSE: recursion Jevel=%d, 
irai5)ass=%d,previous_imiarked===^od, era 

*iecuision_counter,nmipass,previous_mnaiked, currentjmaiked); 
} 

elen - strleii(epotemp_temp.monomersequence); 
nfaied«0; 

for(ii=0; ii< elen; ii++) { ifl[q)ota2q>jtea]p jnarked[ii] = 
TRUE)nfflled4+;} 

if[nfilled >« nfilledmax) { 

ou1put_fresh_alignment(&q>otemp_tenq)); 
continue; 

} 

libraiy[nimpass].recursion_tagged - TRUE; 
(♦recursion_counter)++; 
recuraejliirough_the_Kbrai>tnfiUedinax,e^^ 
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ids»nUb»libiaiy,recuisionjcoiiii^ 

Ubrary[imnpass] jreaiidoajtagged = FALSl^ 

("•irecursionjcouiiter)--; 
' }/*imnpass*/ 
}/*iecurseJim)ughjfhe_Ulra 
I* 

PURPOSE: 

INPUT: 

OUTPUT: 

returns the adze of the largest maxitnal adjacent set of 
monomers inserted. ' 
PROCEDURE: 

*/ 

int maximal_adjacent_aligmneixt( 
LIB ♦iepothilone, 
int nwildcard, 

char wildcards|>IAXWILD]IMAXLEN], 
LIB *library, 
int ilib, 

int smallest_acceptable_piece) 
{ 

int ii=K),ij=O,kk=0; 

int ecomit=04corai1==0; 

int qx>lMlonelen=0; 

int nlargqstpiece=0,tnlargestpiece=0; 

int holdjhis_lcoiintr=0, hDldjlhis_ecoimt=0; 

int wildcaidmatch==FALSE; 

char *wptr; 

char *largestpiecej5ptr,*largestpiece_epti:; 
char ♦holdJhisj)lace_eptr, *holdjflrisj>lace_sptr, 
int iargestpiece_lcoimH),large3tpiece_ecoimt==€; 
char *sptr, *eptr, *lptr,*bu^tr; 
iflPEBUGJVILDCARD) { 
if(nwildcard>0){ 

§)rintf(stdout,"maximal__: 
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wfldcaids[0>=^\n^ivUdcard5[0]); 
} 

} 

if(DEBUG_ALIGN) i^rintl^stdout,"]naximal_adjace^^ 
smaUestjBcoqptable^pieceF^i^ 

sptr = lihmy[ilib].monomerseqiieace; 

eptr = q)othilone->inon0m6rsequence; 

ecoimW); 

IcounM); 

nlargestpiece=0; 

tiilaigestpiece=0; 

hold_tbis_j)lace__eptr = eptr; 

hold_this_ecount = ecoiint; 

whae(*eptrN^O0{ 

sptr ~ library[ilib] jnonomersequence; 
lcoimt = 0; 

hold_this__place_sptr = sptr; 
hold_tbis_lcoimt = Icoimt; 
wildcardmatch = FALSE; 
wlule(*sptrN^O•){ 

wildcardxnatch FALSE; 
if(q)othilon&>maiked[ecount] = FALSE) { 

/* code for wildcards added MAS 05-16-00 */ 
wptr-"-; 

if(*eptr — 'X){ wptr = wildcards[0]; } 
else if(*eptr = V) { wptr = wildcaids[l]; } 
else if(*eptr === 'Z^ { wptr = wildcaids[2]; } 
while(*wptr!=^0•){ 

ifl[*wptr'=*sptr){ 

wildcardmatch TRUE; 

brealq 

} 

wptrH-; 

} 

if((wildcaidmatch = TRUE) || (*eptr = 
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*sptr)){ 

tnlargestpiec&H-; 

if(PEBUG_ALIGN) ^priiitf(std<nit, 
"FOUND a match: lea=^/od, epo(%d, %c), lib[%d] Jiame=%s (%d, %c)\q", 

tnlarge^ece, 

ecount» *qitr, ilib»lihraiy[ilib]jQa]ne4coimt, *sptr); 

if(tnlargestpiece > nlargestpiece) { 
nlaigestpiece » tnkcgestpiece; 
larges^iece_sptr== 

holdjbisjplace^sptr; 

laigestpiecejcount ^ 

hold Jfais^lcount; 

larges^iece^eptr = 

hold_thisj)lace_eptr; 

laiges^iece_ecount » 

holdjius_ecount; 

iflJ)EBUG_,AUGN) 
i5)nntf(stdout, 'TOUND a largest piece: len==%d, epo(%d, %c), 
Ub[%d].iiame==%s (%d, %c)\n", 

nlargestpiece, 

largestpiece_ecoimt, *largestpiece_qptr, 
ilib,Iibrary[ilib].name4aigestpiece_lcoimt, *largestpiece_sptr); 

} 

sptH-f; 
lcount++; 
eptrH-; 
ecountH-; 
}else{ 

tnlarges^i6ce»0; 
sptrH-; 
lcoimt++; 
/•NEW*/ 

holdjhisjplace^sptr = spti; 
hold_fliis_lcount == Icoiint; 
epfr = hold_tbis j)lace_eptr; 
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ecount = hold Jhis^ecount; 

} 

> else { 

tnlargestpiece-0; 
brealc; 

> 

} 

tnlargestpiece'^O; 
eptr - holdLfliis j)lace_eptr + 1 ; 
ecount - hold^fbisjecoimt + 1; 
hold_this j)lace_eptr = ept^ 
hold_this jscount = ecount; 

} 

if(DEBUG_ALIGN) ^rintf(stdout,"ALIGN: largest piece match is %d monomers 
from %s\n",nlargestpiece,library[ilib].name); 

if(DEBUG_ALIGN) ^)iintf{stdout,"ALIGN: largestpiece_ecount=%d, 
laiges1pieceJcount=%d\n", 

largestpiece_ecount,largestpieceJcount); 
if(nlargestpiece >= smallest_acceptable_piece) { 

ifCDEBUG_ALIGN) ^rintf{stdout,"ALIGN: incoiporated\n"); 
Icount = largestpiece_lcount; 
ecount = Iargestpiece_ecount; 
while(ecoimt < (nlargestpiece + largestpiece_ecount)) { 
q>othiIone->a]ignedsequence[ecount] = 
library[iHb]jnonometsequence[lcount]; 

stE€py(q)otfailon6->aUgaedPKSname[ecount],Ubra37[ihl)].name); 
strcpy(q)othilone->context[ecount],Ubrary[iUb].contex^^ 
q)othilane->marked[ecount] = TRUE; 
ijS[ecount < (nlargestpiece + largestpiece_ecount - 
1)) q)oMone->boundarytorigJit[ecount] = FALSE; 
lcount++; 
ecountH-; 

} 

} 

return (nlargestpiece); 



wo 01/92991 PCT/DSOl/17352 

52 

}/^^maximaljadjacentjdigmneiit*/ 
/♦ 

PURPOSE: 

INPUT: 

OUTPUT: 

letums the size of the laigest maximal adjacent set of 
monomers inserted. 

PROCEDUKE: 

♦/ 

LIB '^'epothilone, 
int nwildcard, 

char wildcaidsIMAXWIU)]|>IAXIJSN^ 
LB ♦library, 
int ilib, 

int smallest_acceptable_piece) 
{ 

int ii=0,jj=0,kk=0; 

int ecount=0,lcomit==0; 

int elen=0; 

int epothilonelen=0; 

int nlargestpieceF=0,tnlargestpiece=O; 

int hold_this Jcoimt=0, hold__tbis_ecount==0; 

int wildcardmatch=FALSE; 

ch^r *wptr, 

char ♦largestpiece_sptr,*largestpieceeptr; 

char *hold_this__place_eptr, *holdJhisjplace_sptr, 

int largestpieceJ[coimt=0,largestpicce_ecoiintF=0; 

char. *sptr, *eptr, *lptr,*bn§)tr; 

if(DEBUG_WILDCARD) { 
if(nMdldcard>0) { 

^rintf(stdout,"maximal_: 
wildcards[0]=%s\n",wildcards[0]); 
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} 

} 

Q)rinti(sMout»'*maxinial_ai^*^ 
SInanestJaccqptable_piece==%^\n^ 

sptr » lib]:ary[ilib].m(momeiseqiieQce; 

eptr = epo^on&->mQnoinerseque!QC^ 

elen = strleii(epothnone->mQnomersequeQc^ 

ccoimtrO; 

lcount=0; 

nlargestpiecepO; 

tnlargestpiece=0; 

hold Jhis jlace_eptr = ept^ 

hold_this_ecount = ecoimt 

while (*eptr!=^O0{ 

sptr = libraiy[ilib] jnonomersequence; 

lcount = 0; 

/*NEW*/ 

hold_this_place_sptr = sptr; 
hold_this Jcount = Icount; 
wildcardmatch = FALSE; 
while(*sp^r!=^0"){ 

wildcardmatch = FALSE; 
if{epothilone->maiked[ecouiit] = FALSE) { 

I* code for wildcards added MAS OS-16-00 ^/ 
wptr=""; 

ifl[*q)tr = "X) { wptr « wildcards[0]; ) 
else ifl[*qptr = Y) { wptr = wildcards[ll; } 
else ifl[*q)tr '27) { wptr = wildcards[2]; } 
whae(*wptr!=^0'){ 

ifl[*wptr=*sptr){ 

wildcardinatch = TRXJE; 

break; 

} 

wptrH-; 

} 
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ifl[(wil<fcardinatch = TRUE) || (*eptr = 

*sptr)){ 

tnlaiges^ec&H-; 

if(DEBUG_AUGN) 4)iintf(stdout, 
TOUND a matoh: letF^ epo(%d, %c), m)[%d] jiame=%s (%d, %c)\a^ 

tnlargestpiec^ 

ecoiml, *eptr» iUb,libiaiy[ilib] juunejcount, '*'sptr); 

if(tnlargestpiece > nlargestpiece) { 

nlargestpiece = tnlargestpiece; 
larges^iecejsptr ^ 

hold_this_place_sptr; 

largestpiecejcoimt - 

hold_tliisJcount; 

largestpiecejeptr = 

hold_thisjplace_q)tr; 

latgestpiecejecount = 

hoidJliis_ecount; 

if(DEBUG_ALIGN) 
^rintf(stdout, "FOUND a largest piece: len=%d, epo(%d, %c), lib(%d, %c)\q", 

nlargestpiece, 

largestpiece^ecount, *larges^iece_q)tr, largestpiecejcount, 
*larges^iece_sptr); 

} 

splrH-; 
Icount-H-; 
eptrH"; 
ecount-H-; 

} else { 

tnlargestpiece = 0; 
sptrl 

Icoimt++; 

holdJiiiaj)lace_sptr = sptr, 
holdjfihis Jcount = Icount; 

eptr = holdjhis j)lace_eptr; 
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ec^imt = holdjthi8_ecoimt; 

} 

} else { 

tnlatges^iece = (^ 
break; 

} 

} 

tnlargestpiece-CH 
q>tr = holdjlhis_place_eptr + 1 ; 
ecount = holdthisjecoimt + 1 ; 
hold_this_place_eptr = qptr; 
hold_this_ecount = ecount; 
if{DEBUG_ALIGN) { 

^rintf(stdout,"increm6ating 
holdJhisj)lace_eptr=%c, 

hold_this_ecount===%d\n^*hold_thisj)lace_q>trJtoldj^ 
} 

} 

if(DEBUG_ALIGN) ^rmtf(stdout,"AUGN: largest piece match is %d monomers 
from %s\n"^argestpiece4ibrary[ilib].name); 

if(DEBUG_ALIGN) i^rintf(stdout,"ALIGN: largestpiece_ecomit=%d, 
largestpiece_lcomit=%d\n", 

largestpiece_ecount,largestpiece_lcount); 
if(nlarges^iece >= smallest_acceptable_piece) { 

ifCDEBUG^ALIGN) fyrintfl[stdout,"ALIGN: incorporated\n"); 
Icount = largestpiece Jcount; 
ecount = largestpiece^ecount; 
. ^tfl[stdout,"ALIGN.TARGET: "); 
foi(ii=0; ii<largestpicce_ecount; ii++) { 
§»intf(stdoirt, " "); 

} 

while(ecomit < (nlargestpiece + largestpiece_ecount)) { 
epolliilone-:>aligaedsequence[ecomit] = 
libraiy[iIib]jnonomersequence[lcomit]; 

sttt5)y(epothilone->alignedPKSname[ecount],Ubraiy[iKb]j^ 



wo 01/92991 PCTAJSOl/17352 

56 

strcpy(epotfaflQne->c<»)text[ec^^ 

q)othiloiie->inaiked[ecomt] « TRUE; 

ij^e(x>imt < (nlarges^iece + laxgestinece_^ 
1)) ef)oM(me->boimdarytoright[ecount] = FALSE; 
Q)rintf(std(mC%c''Jffl>raiy[ilib].mm 

ecx>uitijt++; 

} • 
fQr(ii=eooimt; ii<elen; ii-H-) { 

§>riiitf(stdout," "); 

} 

^rint^stdoiit,"%sW'4ihraiy[iUb]j[iame); 

} 

return (nlargestpiece); 
}/*maadmal_adjacent_alignment_aiidjiuiiq)*/ 
int . outputjGresh_alignmeiit( 
LIB *epotemp) 
{ 

• int ecoimt=0; 
char *qptr, 

char boundaryPVlAXNAMELEN]; 

^tr = epotemp->inonomersequence; 
ecount = 0; 

epotemp->nboiindary = 0; 

strcpy(boundary,epotemp->aUgnedPKSname[ec^ 

wllile(*eptr!=^0'){ 

if(qpotemp->boTmdarytoright[ecount] = TRUE) { 
q)otexnp->nboundaryff; 

} 

ecounW-; 
eptrH-; 

> 

if(epotemp.>nboiindary > >©OUNDARY_CUTOFF) return 1; 
^tr = epotemp->monomersequenc^ 
ecouiit = 0; 
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§)rintf(stdoi|t,"Hrr 
whae(*eptr!='\0'){ 

ij^q)Otemp->aHgnedFKSxiame[ecoiml][0] = ^0') { 

i^epoteiiq[>^oundaiytcdgh^ =» TRUE) { 
^iriiitf(stdoT3t"%c:TARG(%s)| 
^*eptr,q>ot6aq)->conte}ct[eco^lIIt^ 
} else { 

i|mntfl[stdoirt,"%c:TARG(%s) 
^*q>tr,q>oteaq>->coiite}ct[ecoiuit]); 

} 

} else { 

ifl[q)otamp->boimdarytoright[ecoiint] = TRUE) { 
ft)rintf(stdout,"%c:%4s(%s)| 
Vq)tr,epotemp->aHgQedPKSname[ecoimt]»^otemp->cont^ 
} else { 

^rintfl[stdout,"%c:%4s(%s) 
^'*'eptr,q)otemp->aUgnecl?KSname[ecoimt],q>otemp->^^ 
} 

} 

ecounM-+; 
eptrH-; 

} 

§)rmtf(stdout,"npiece %d",epotemp->nboiindaiy); 

§)rmtf(stdout,"\n"); 

retom 1; 
}/*output_£resh_aIignnient*/ 
int getjibrary( 
char *libraryfile, 
LIB *lihrary) 

{ 

int iiN)jjN)Jkkr<),lcounfr=0; 

int nlib=0; 

char *bul5)tr,bufIMAXBUF]; 

char ttnonomersequeiice[MAXNAMELEN]; 

char ♦^tr*tptr. 
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PDLE *libp; 

&iOdeK>; kk <MAXJJBJBNTRJES; kkW-) { 
UbitDyllk] jecursionjagged » VAISB; 
foi(u=Qi ii<mXNAMELEN; n++) { 
libraryPck] Jiame[ji] = ^0'; 
libraiTfldc] jnoiicnneisequmce[i^ - "NO^; 
]il»aiy|l±].aligaedseque3ic^ii] => "NO^ 
Jfor(jH;ii<MAXLEN;jj++) { 

mnaiTCldcJ^gaedPKSnameQjp] = 'NO'; 
]ibrai7[]dE]jDadcBdQj] «FALS]^ 
]ibiaiy[kk].conte3rt[ij][0] = '\0'; 
libiaiy|ldc].coiitextGj][l] » "^O*; 
/ Kbrai7[kk].cantext[ij][2]=»W; • 
libraty[tdc].cantextQj][3] - "NO*; 

> 

} 

libiarylldc] Jibouiidaiy ~ 0; 

> 

/* read in live library from PKS.lib */ 
ijB[hjUIX===<Kbp==fopen(libraryfile,"r"))) { 

^wintfl[stdout,"TRY AGAIN; couldnt open %s\n''^braiyfile); 

iilib=0; 

eadtO; 

} 

iilib=0; 

while(iiKb<MAX_LIB^ENTRIES) { 

ifl34ULI>=fgets(biif;sizeof(bijf),libp))brealq 
bii^tr=bii^ 

ifl[*buJ4)tr =='#') contiime; 
if(*bu^N'\n'){ 

tptr a libiary[iilib] jiame; 

while(C*biU|)tr!=' •)&&(*bu^lr!='VO')&& 

(♦bu4>trt='\n')){ 

•l^trH— *bu4rtrH-; 

> 
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ifl[(^i#tr !» -VOO && (•bu^ !- "Nn')) bii|>tr++; 
^ptr = Bbrajy[iilib].iaanonigr8eqii e nce; 
T)diile((*bui|)tr!=* •) && (*b«4»tr 1=* ^0!) 

(*bu§>lr!-'\ii')){ 

/* This code spedficaUy ddetes inter-modalar 
double bonds, optional.*/ 

ifC*bu^tr!='^{ 

*lptrH-=*bu^trHs 

} else { 

bu§)trH-; 

} 

} ■ 
*lptr=^0'; 

if((*bu^tr != ^0•) && (*bu^tr != "W")) bu4>trH-; 
Iptr = Iibrary[nlib].a]motatedsequeDce; 
while((*bii§>tr!=' •)&&(*bu^tr N^0')&& 

(♦bu^trN-Vn*))! 

} 

*lptr = '\0'; . 

if((*bu^tr != ^O-) && (*bi4)1r != U")) bu^trH-; 

fyrintf(stdout,"LIBRARY(%d) %s: 
%s\a";DUbJUbrary[iilib].naiiie^brary[nlib]jnonomeTsequence); 

^rmtf(stdoat,"LIBRARY(%d) %s: 
%8Vn",idib.libraiy[iiIib]jDam6.Iibia]7[ii]ib].aimotatedseqaeiu; 

nlib++; 

} 

} 

fcloseOibp): 

ftrOdpK); kk< idib; kJcH-) { 
lcouiit"(^ 

Iptr =■ library|1d£].maaQmei8equBnc6; 
while(*lptr!=^0•){ 

if(lcqunt=»0) { 
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lihraiy[l±].itxmtextPcoimt][^^ 
lihrary|Mc].contertPcoiint][l^ 
UhraryItt].cQiitext[lco^^ = ♦ft)tr+ 1); 

} else { 

if(lC0UDt = 

(stden(]ihrary[Idc].mQQomGrseqam - 1)){ 

hW7[Idc].cantext|lccnm4 » 

*Optr.l); 

IibraryPdc].cQiitext[lcoimt][l] - 
lihraryDdclxonte^^^ = 

} else { 

Ubrary[1d(;].context[lcomit][0] - 

*Optr-l); 

fihiary(1dc].context|lcoi]iit][l] = 
Kbimy[kk].context[lcount][2] == 

♦(lptr+1); 

} 

> 

m)rary[kk].context[lcoimt][3] = ^0'; 

IptrH-; 

IcountH-; 

} 

j5)imtf(stdout,TJBRARY(%d) %s: 
%s\n"jkkjibrary[ldc].iiamejibraiy^ 

flpiiiit£(stdout,'lIBIURY(%d) %s: 
%s\a"jdc,libi:a]70^]jiameji^^ 

foitijH^ jB< strlen(Ubrary[ldc].monomers^ { 
i5irintf(stdOTt,"(%s)''Jibrary|>k]^ 

} 

:^tf(stdout,*V*); 

} 

return nlib; 
}/*getJihrary*/ 
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int diiitq>_STARTER_afigD( 
LIB epofbflone, 
int iiwildcard, 

char wildcaids[MAXWIIJ)]|>fAXI£^ 

) 

{ 

int elenpO; 

int ecoimH)4ioljd_ecounM); 
int wildcardmatclp'FALSE; 
char *^tr,*eptr,*wptr; 

elen == 5trl^(q>othilone.monomeis6quence); 
eptr = epothilone.monomersequence; 
J5>rintf(stdout, "ALIGN^TARGET: 
while(*eptr!=*\0') { 

§)rintf(stdout,"%c",*eptr); 
eptt++; 

} 

i5)iintf(stdout,"\n"); 
ecount=0; 

eptr = epothilonamonomersequence; 
sptr = epothilone.alignedsequence; 
i5nintf(stdout, "ALIGN^TARGET: 
while(*eptr != '\0') { 

wildcardmatch = FALSE; 

i£(*eptr== X*) { wptr= wildcards[0]; } 
else if(*eptr = V) { wptr = wildcards[l]; } 
elseif(*eptr = 'Z*) { wptr«wildcards[2]; } 
l?l*ile(*wptr!=^0'){ 

if(*wptf=*sptr){ 

wildcardmatch = TRUE; 

hiealq 

} 

wptrH-; 

} 
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if(wildcaidmalch=TRUE) { 
^rintf(stdout":"); 

}else{ 

if(*eptr=*sptr){ 

:5>rintf(stdout,"n; 

}e]se{ 

i^rantfCstdout.""); 

} 

} . 

eptrH-; 

sptrH-; 

} 

§)rintf(5tdout,"\n"); 
^rintf(stdout,"ALIGN_TARGET: 
eptr = epoftsilonejnonomerseqaesice; 
ecoimt=0; 

wlule(♦ept^!=^0'){ 

if(epothilone.alignedsequence[ecoimt] = ^0') { 
^rintf(stdout," "); 

} else { 

^rtnti(stdout,"%c'*,epothilone.aligaedsequeQce[ecouiit]); 
hold_ecount = ecount; 

} 

q)trf+-; 
ecount-H-; 

} 

ft>rintf(stdout," 

%s\a^q)ofhflone.aUgnedPKSname[hold_6Coimt]); 

Qmntfl[stdout,"STARTER_AUGN:\n"); 
}/*dunip_STARTERalign*/ 

res6t_epote[np( 
LIB '■'epotenq), 
LIB epotiulone) 
{ 
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elea » strl6Q(epofhiloiiejnQiiomeisequeace); 
strq)y(qx)temp->iiame,q)othUone.nam 

strq)y(q)otemp->aUgQedseqpience,epotiu^ 
epotenq>-^>i]boimdary ^ 0; 
foiQj=<); jj< eleaol; { 
strq>y(epoten[q)->aUgnedPKSiiame^^ 

q)otemp->maiked|j]] — epofhiloiiejnad:ed[jj]; 

stitpy(q)Oternp->conte3ct(jj],q)otl^ 

q)otemp->lK)undaxytorigbt[]j] - q)Olliilcme.1x)iindaiytoriglit[ij]; 
}/*resetjepotCTip*/ 

EXAMPLE6 

Source Code: 
#include<stdio.l3<' 

/* '^aiii/prograins/morph/morph4.c 

PURPOSE: To recursively traverse all the entries in PKS.lib, generating all feasible 
combinations of PEIS modules to make the TARGET (e.g., cpothilone). 

INPUT: -b number_boundary cutoff: lets user set the maximum number of 
boundaries in output lines. This defaults to 5 (#define NBOUNDARY_CUTOFF 5) which is a 
reasonable assuniption for something of the length of epothilone (8 modules). However, when 
looking at disco-dennolide which has 1 1 modules, a cutoff of 5 sometimes results in too few 
ou^ut lines; it is too restrictive. 

-d allows one to ignore the inter-modular doublebonds in the library file. 

-1 libraryfile: tab-delimited CHUCKLES-coded polyketides file with the following 
columns 

1. polyketide name 

2. plain OIUCKLES 

3. annotated CHUCKLES (contains information about post-synthetic 

modifications) 

4. source organism 

-n targetname: user-defined name (e.g., epoD) 

-t targetsequence: CHUCKLES-coded poljketide of desired TARGET (e.g., 
MEMLJDGE) 

-w, -X, -y, -z sets of wildcards: sets of monomers for particular positions appearing in 
taxjgetsequence. The wildcards can effectively be used for analoging the TARGET polyketide. 

Hard-coded parameters which may be reset (requires recompiling): 
#defineNBOUNDARY„CUTOFF 5 

NBOUNDARY_CUTOFF determines the maximum number of non-n^ative to 
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modular inter&ces which are (X>ntam This is now set to 5, but may be increased 

when die user does not care about inefiSdendes introduced by these interfices or when flie 
targetsequeoce is very lengOiy. 

<MefineRECmSION_COUNTER_COTOFF 2 

MCURSION_COUNTER_CUTOFF speciiaes the numbo: of levels of recursion 
(defaults to 0, 1, 2) acceptable for the run. This limit must be set since the large PXS library can 
result in recursion that will combinatorially explode. Because of the multi-^directionality of &e 
alignments (using every Ubrary entry as a STARTER), there is no need to go beyond 2 levels of 
recursion. However, there may be cases in the future where this number should be increased. 
Note that while recursion will eventually terminate without this parameter, runs with a library 
over about 20 PKS entries may run for years on a reasonably fast computer. 

OUTPUT: All combinations of modules that meet parameters set by user. 

Example output from MEMUDGE (epothilone D) using subset of PKS.lib. 
Vertical bars indicate non-native inter-modular interfaces. Last column contains the number of 
"pieces" that are needed to put together the PKS. 

Names of PICSs have been abbreviated to fit them in these comments. 
HIT M:3atyl(FMN)| E:tedan(GEH)| M:aldga(BML) L:aldga(MLG)| J;iadga(GJD) 
D:aldga(JDL)| G:tedan(JGE) E:tedan(GEH)| 5 

HIT M:aIbMl(LME) E:albMl(MEJ)| M:albMl(LME)| L:aldga(MLG)| J:aIdga(GXD) 
D:aIdga(JDL)| G:tedan(JGE) E:tedan(GEH)| 5 

HIT M:albMl(LME) E:albMl(MEJ)| M:aldga(BML) L:aIdga(MLG)|J:aldga(GJD) 
D:aldga(JDL)| G:3atyl(NG0)| E:albMl(MEJ)| 5 

HTT M:aIbMl(LME) E:aIbMlG^| M:aldga(BML) L;aldga(MLG)| J:aldga(GJD) 
D:aldga(JDL)| G:aldga(LGJ)| E:albMl(MEJ)| 5 

HTT M:albMl(LME) E:albMl(MEJ)| M:aldga(BML) L:aldg?(MLG)| J:aldga(GJD) 
D:aldga(JDL)| G:aldga(LGJ)| E:albMl(MEJ)| 5 

USAGE: 

moiphS -1 libraryfile -n targetname -t targetsequence [-w W-wildcards] [-x X- 
wildcards] [-y Y-wildcards] [-z Z-wildcards] -d 

samples: 

# generate combinations that yield epothilone D 

%morph3 -1 PKS.Kb -n epoD -t MEMUDGE > omoiph3_epoD 

%egrep HIT omorph3_q)oD | sort | uniq | sort +10 -1 1 > 
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Qmoiph3_q>oD.iHiiq.sQrt 

%egrep AUGNJTARGBT omoiph3_epoD > 
(miDiph3jq?QDJSTARTER_AUGN 

# geaerate combinatioiis flmt yield epofhilone D wifh a C13-liydiox^ 
%moiph3 -1 PKS.Kb -n qK>D-130H -t MEXUDGB -x ABCD > ocpoD*130H 
%egr^ HIT oepoD-lSOH | sort | uniq | sort -f 10 -1 1 > oei)oI>-130H.itmq.sQrt 
%egrep ALIGN.TARGET ocpoI>-130H > oepoD-130H_STARTERAUGN 

# generate combiiiatiQn that yield qpothilcme wifli fhe fbllowingwildcaids (set 1). 
%moiph3 4 PES.lib -n epoD-setl -t MEXYZD^ -x ABCD -y LEFIN -z 

JACGM > oepoD-setl 

%grep Hir oepoD-setl | sort | uniq | sort +10 -1 1 > oepoI>-setLiin2q.sQrt 

# generate combination that yield epothilone with the following wildcards (set 2) 
%moiph3 -1 PKS.lib -n epoD-set2 -t MEXYZDgB -x JK -y EF -z JACGM > 

oepoD-set2 

%grep HIT oepoD-set2 1 sort | uniq | sort +10 -1 1 > oepoD-set2,umqj5ort 
LIMITATIONS: 

Current implementation cannot handle intra-modular modifications/splitting 
because morph is operating at the monomer level. Future implementations could convert the • 
CHUCKLES-encoded strings into the conesponding and equivalent SMILES and then perfonn 
more complex chemical analysis of the PKS molecular graphs. Currently, inter-modular double 
bonds are present in the Khrary, but are ignored by the morph program. 

MODIFICATIONS: 

+ added ability to inchide user-defined wildcards (X, Y, or Z) on the 

command line. MAS 05-16-00. 
+ added additional wildcard (W). MAS 05-30-00. 
+ added addition (summary) column to HIT output list MAS 05-30-00. 
+ added command line argument for suppressing the inter-modular double bonds 
in the library. Default is not to treat these as separate modules. MAS 05-3 1-00. 

+ added column that contains the length of the largest matching fragment MAS 

06-05-00 
*/ 
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#de&i6TRUE 1 
#definBPALSB 0 
#defiaeDEBUG_MATCH FALSE 
#de&ieDEBUG_STARTBR FALSE 
#de&)eDEBUG_ALIGN FALSE 
#defineDEBUGJREaJBSE FALSE 
#defineDEBUG_WILDCARD FALSE 

#defiiieMAXLEN 80 

#defineMAX_TYL_LEN 6 

#defiiieMAXJBPO_LEN 6 

#defineMAXNAH4ELEN 500 
#defineMAX_LIB_ENTRIES- 500 
#defineMAXWILD 4 
#defibaeMAXBUF 1000 

#defiiieNBOUNDARY_CUTOFF 5 
#defiiieRECURSION_COUNlER_CUTOFF 2 
#defineSTARTER_MINlMUM_ADJACENT_ALIGN 2 
#de£meMINIMlM_ADJACS>nr>UaN 2 

typedef5lruct_lib { 

char name[MAXNAMBLEN]; 

char monomersequencelMAXNAMELEN]; 

char annotatedsequeacelMAXNAMELEN]; 

char. aligaedsequencelMAXNAMELEN]; 

. f 

char aligoedPK:Sname|>lAXLEN][MAXNAMELEN]; 

int boundarytorighttMAXNAMELEN]; 

int maikei^MAXLEN]; 

char coiitext[MAXLEN][4]; 

int recuisiQii_tagged; 

int nboundaiy; 

}LIB; 
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mainOoit argc, char **argv) 
{ 

int u=<),ij==0,kk=0,B==0; 

int nlib=0; 

int ecoimM); 

int nfille(l=0,]]filledmax=0; 

int q)ofbilQnelei:H); 

int nlargestpiece=^,tnlargestpiec6=<); 

int Tnmpass=0; 

int IcounH); 

int newj]nniaii:ed_entries_filled»0; 

int reciirsionj5oiinter = 0; 

int nwildcaxd=0; 

int best_newjniiiiaiked_entries_fiUed = 0 

int sniallest_acceptable_piece = 0; 

int ciirrent_mnaricedH),previousjamarked==0; 

int inter jnodularjib_flagjoff =^ FALSE; 

int nboimdary_cutofiHNBOlM)ARY_CirrOFF; 

char *q)tr, *eptr, *lptr,*bu4)tr; 

char *chbptr; 

char *libraryfile; 

char *targetsequence,*targetnanie; 

char buf[MAXBlIF]; 

char ivildcaids[MAXWIIJ>][MAXLEN]; 

FILE *libp; 

LIB epotanp; 

UB hT)raryIMAX_LIB^ENTRIES]; 

UB epolhilone; 



char *progname; 

char **filelist, ♦♦fileptr. 
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Kbraryj51e='*"; 
taigetsequence = 
taigetaame = 

for(u=0; ii<MAXWILD; ii-H-) { 

for(U=0;ju<MAXI£N; jj++) { . 
wildcaids[u][ij] = ^0'; 

} 

} 

/♦ process arguments */ 

filelist = filq)tr = (char ♦*XinaUoc(argc * sizeofl[*argv))); 

piDgname ♦argv++; . 
ifl[argc<2){ 

i^rintf(stderr,"usage:%s [-b nboundarjr_cutoffl [-d] -1 libraryfile -n targetname *t 
targetsequence [-w W-wildcards] [-x X-wildcards] [-y Y-wildcaids] [-z Z-wildcards] 
\n",progaame); 

exitO; 

} 

while(argc- > 1) { 

if(argv[0][0] = && argv[0][l] != ^0') { 
/* handle option */ 

♦•H<*argv); /* advance past the mimis */ 
switch(**argv) { 

case V: /* get numb^ of boundaries cutoff for output of 

afignments . 

argv-H-; argc-; 

s^canf(ai:gv[0],"%d^&nboundaiy_cutofF); 
§irintfl[stdenr,"-b: 
nboundaryjcnitoflM4dW',nboundaiyjcutofl5; 

break; 

case 'd': /* ignore inter-modular double bonds in the library file */ 
interjmodular_db_flafiL.off = TRUE; 
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^rmt^stdetr,''-d: inter-modular double bonds igQoted.\n"); 
break; 

case T: /* get library input filename (PKS Jib) ♦/ 
argv++; argc-; 
libraryfile - argv[0]; 

Q)rintfl[stdetr,"-1: libraryfae=%s\a"3braryfae); 
break; 

case V: I* get target name string */ 
argv++; argc~; 
targetname ^ argv[0]; 

^rintf(stdecr/-t: targetname=%s\n'',targetname); 
break; 

case Y: /* get target sequence string */ 
argv++; argc-; 
targetsequence = argv[0]; 

^rintf(stden;"-t: targetsequence=%s\n",targetsequence); 
break; 

case Nv': /* get a wildcard string */ 
argv4-f ; argc-; 
strcpy(wildcards[0],argv[0]); 

15>rintfl[stdenr,"-w: wildcards[%d]=%8\n''Awadcards[0]); 

nwildcard-H-; 

break; 

case V: /* get a wildcard string */ 
argvH4-; argo-; 
sticpy(wildcards[l],aigv[0]); 
i5)rintf(stdeiT,''-x: wildcards[%d]=%5\n'',l,wildcard8[I]); 
nwildcard+f ; 
break; 

case y : /* get a wildcard string ♦/ 
argv++; argc-; 
strcpy(wildcard5[2],argv[0]); 

i5»rintf(stdeir,"-y: wiIdcards[%dH/os\n"^,wildcards[2]); 

nwildcard-H-; 

break; 
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case 'z*: /* get a wfldcard string ♦/ 
argv++; argc-; 
strq)y(wildcards[3],argv[0]); 
i|xrintfl[stdeir,''-z: wildcaxx]s[%dh^os\a''3,wildcaids[^^ 
nwildcaidH-; 
brealQ 

case W: /* get a wildcard string */ 
argv++; argc-; 
stcq)y(wildGaids[0],argv[0]); 

^rintf(stdcir,"-w: wildcards[%d]=%s\Q"Awadcards[0]); 
nwildcaidH-; 

break; 

case X': /* get a wildcard string */ 
argv++; argc-; 
strcpy(wildcards[l],argv[0]); 

i|)rintf(stderr,"-x: wildcards[%d]=%s\nM,wadcaids[ll); 

nwildcard++; 

break; 

case TT: /* ^et a wildcard string */ 
argv-H-; argc-; 
strcpy(wildcards[2],argv[0]); 
J 5)rintf(stderr,"-y: wildcards[%d]=%s\n"^,wildcards[2]); 
nwildcard-H-; 
brealq 

case 'Z*: /* get a wildcard string */ 
argv++; argc--; 
sticpy(wildcatds[3],argv[0]); 

^rintf(stderr,"-z: wildcards[%d]=%s\a"^,wildcards[3]); 
nwildcaid-H-;. 
break; 
de&ult 

i|)iintfl[8tderr/'%s unknown option; ignored\n",*argv); 

}/*switch*/ 
} else { /* a regular filaiame */ 
*fileptrH- = *argv; 
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*fileptr=NULL; 

} 

}/*wliile*/ 



ifl[iiwildcard>0) { 

foi(ii=0; ii<nwildcard; ii-H-) { 

^imtfl[stden:,"wildcards[%d]===%s\a"4i>wU(^ 
^rintf{stdout,'*wildcards[%d]==%s\n^u,wildcaid^ 

} 

} 

q)othilone jiboimdary = 0; 

foi(u=0; ii<MAXNAMELEN; ii-H-) { 

epo1failone.name[ii] = ^0'; 

epothilone.monomersequence[ii] = W; 

epothilone.alignedsequencepi] == ^0'; 

epothiloae.botiiidarytorig^t[ii] = TRUE; 

for(ij=0; jj<MAXLEN; ij++) { 

epothilone.aIignedPKSname[jj][ii] - ^0'; 
q>othilone.markcd[jj] - FALSE; 
epothil0ne.cx>iitext[ij][O] = ^0'; 
cpothilone.oontext|jj][l] = W; 
epothilone.context[ij][2] =^0'; 

} 

} 

strq>y(q;)othilonejiame,targetaame); 
stn7y(€pothiI(mejnonDmersequeQce,targetsequenTO^ 

§naiitf(stdout, "TARGET: %s\a", epothilbnamonomersequence); 



ecount = 0; 
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eptr = epotlulone.monomersequetice; 

whae(*eptrN^O'){ 

if(ecoiint = 0){ 

q)ofhilone.context[ecoimt][0] = 
q>olMQne.co]it6xt[ecauixt][l] = *eptr; 
q)ofhjlaae.coiite3ct[ecoimt][2] « ♦(eptr + 1); 

>else{ 

ijB[ecount = (strIen(e{K>fhilQnejiionoiners6queiice) - 1}){ 
q)othiloiie.cQatext[ecoimt][0] = ♦(q)tr- 1); 
cpotliiloiie.context[eoouiit][l] = *eptr, 
qx>Moiie.cQntBxt[ecoui]i][2] - -'; 

} else { 

epothilonexontext[ecount][0] = *(q>tr - 1); 
epothiloiie.context[ecount][l] « *epti^ 
epothiIone.coiitext[ecount][2] = *(q>tr + 1); 

} 

} 

q)othilone.coiitext[ecoiiiit][3] = 

eptrH-; 

ccountH-; 

} 

for(ii=0; ii<ecount; ii++) { 

Qn3ntf[stdout,T^s)W^epotbUone.context[u]); 

> 



f* iflntaiy */ 

nlib = gBtJlibrary(]ibiaiyfile^]aiy,inte^^^ 
§>rintf(stdout,''nlib=%d\n"AiIib); 



ldE=(^,^e(kk<nlib){ 
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/* zero out the epolMloiio entcy wifh re^ct to anew aUgnmeot — — ♦/ 

forCiH); n<MAXNAMELEN; n++) { 

qpofliilone.aligaedseqiieDce[ii] = ^0'; 
q)ofhflone.boiHidacytoright[ii] « TRUE; 
for(D==0;ij<MAXIJEN;ij^^ 

epothilone.aligiiedFES2iame[jjj][ii] = yV; 
q;)otiiUone JiiaxkedQ]] » FALSB; 

} 

} . : 

y* reset the co]itext1>aQk to that ia q)othilone */ 

ecount=0; 

eptr - epothilone jnonomerseqaeace; 
whae(♦ept^!=^0'){ 

if(ecount===0) { t 

epothilone.context[ecount][0] = 

epo1hilone.context[ecount][l] = *eptr; 

epothilone.cont€Xt[ecoijnt][2] « *(eptr + 1); 

} else { 

ifl[ecount = (strleii(epotbilone.monomersequence) - 1)){ 
epotliilone.context[ecouiit][0] = *(eptr- 1); 
epothilone.context[ecount][l] = *cp\r, 
epothilone.coiitext[ecoiint][2] = *- ; 

} else { 

q)othilonexontext[ecount][0] = *(eptr - 1); 
. epothilone.coiitext[ecoiint][l] = *eptr; 
epothiloiie,context[eco\mt][2] = *(eptr + 1); 

} 

> 

q)otfailone.conte3Ct[ecount][3] = ^0'; 

epIrH-; 

Bcoiint++; 

} 

align STARTER (current library aitry) and epothilone */ 
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sptr » libxB]7[kk] jnonomersequeQce; 
loounH); 

while(♦sptrN^O0{ 

Q»iintfl[stdoul;'1ihraiy[%d]ja^ 

kk,lcoimt,]ibra]7[ldc]jnon(mieisequeiicePro 

sptrH-; 
lcounl++; 

} 

I* Call TiiayiTyial_adiacent_alignmeat untfl it no longer returns more than 
two adjacent modules. There is really no reason to try to extract 
individual modules because this will be done as part of the 
recursive filling of spaces from the library. 

♦/ 

smallest_acceptable_j)iece =5 2; 
eptr = q>othilone jnonomersequence; 
j4)rintfl[stdout, "ALIGN.TARGET: "); 
while(*eptr !=^0•){ 

^tf(stdout,"%c",*eptr); 

eptrH-; 

, } 

i^rintf(stdout,"\n"); 

j|mntfl[stdeir,"aligning %d %s\n",kk, library[kk],name); 

bestjiewjinmaii:ed_enliies__fiUed « 0; 

wliile((new_unmarked_entries_filled = 
maximal_adjacent_alignment_andjiump(&epothilone^wildcard,w^ 
acceptablejiece)) >= STARTERJ4I]mfUM_.ADJACE^ { 

if(best_newjunmarked_entdes_filled < newjinmarked_entriesj511ed){ 
best_new_unmarked_entries_filled = 
newjmmariccdjentriesjSlled; 

} 

if(DEBUG_STARTER) j5>rintf(stdout, "STARTER AOGN: 
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new_uimiaiked_entries_fiUed==%d\a'>ew_^^ 

q>olMoiielen - strlen(qK>tliUonejn0iu^ 
for(ii=0; ii< epothilonelen; ii++){ 

' i{(PEBUG_STARTER)fi?rintfl[stdoii^ 
a best aUgnment hetweea q)0.inonomat[%d]=%c in Ubxaiy[%d] jiaxiieF=%s\ii^ 

ii,qx>1failoQejnQnome(rsequence[ii]Jdc,q)^^ 
} 

} 

Ubimy|ldc] jrecuisionjs^ed =rTRUE; 

fpmtf[stdout,"ALIGN„TARGET: Xn**); 
dimp_STARTER_aUgn(epothUone;Qwildcard,^d^ 
:^rmtf(stdouCALIGNTARGBT: W); 



if(best_newjimnaiked_entries^filled <= 1) { 

§)rintf(stdout,"AUGN„TARGET: PROBLEM 
best_iiew_immarked^entries_filled = %d\n",best_jiew_unmarked_entries_filled); 

^rintf(stdout,''ALIGN_TARGET: PROBLEM skipping tbis STARTER 
entry for library[%d] Jlame^^5\n•^ 

kk,library|kk]jaame); 
Ubiary(kk] jecursionjagged == FALSE; 
kk++; 
continue; 

} 



/♦ jSU in the gaps from the library */ 

/♦ generate a fresh copy of epothilone in epotanp */ 
epothilonelen = strlffi(q>othilone.monQmersequeace); 
nfilledmax - strlen(epothilone.moiiomersequence); 
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i|>rii]tf(stdout, *'nfilledma3F%d\n", nfiHedmax); 
nfilled = 0; 

for(ii=0; u< qpoliuloneleiy { if[epotonqi jnadced[ii] = TRUE) nfilled++; } 
i^EBUGJSTARTER) i|)iintf(stdoiit,"nfilled fixm STARTER==%d\n''^lled); 

foiOmnpass = 0; imxq>ass < nlib; mmpassrH-) { 

i^]mq}ass =kk) { continue; } 
reset_epotemp(&epoteix9»epotfailone); 



nfiUed-0; 

for(ii=0; ii< epothiloneleo; ii++) { if(epotemp.madced[ii] = TRUE) 

nj&Ued++; } 



if(nfilled nJELlledmax) { 

ou1putJ5:esh_aUgDment(&q)otemp,nboimdaxy_cutofi^; 

} else { 

cunrat_nmarked « nfilled; 
previous_nmarked = nfilled; 
stnallest_aocq)table_piece - 2; 
wMe((newjunmarked_eatries_fiUcd « 
Tnayimal_adjaceail_aHgnmeaat(&qpote^ 
lej)iece)) >=MnmfUM_ADJACmrj^ { 

ctiriOTt jamarked 4« newjunmaii 
iflPEBUG_MATC3EJ) 4)rintf(stdou1:, ••main: 
iecursionJeveI=%d, nnnpasff=%d, previousjQmaiked=%d, ciirrait_nmarked===%d\a'', 

recursion_coimt^,mn)pass,previoiis_nmarke4 

cuitentjDinaikBd); 

) . 
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nfilled-O; 

fGxQi=0; ii< epotfailoneleo; ii-H) { i^epotenq) jnaricediii] » 

TRUE)nfiaied++;.} 

ijQ[iijGined>=]ijEUU { 

outpm_fi?esh_aUgimeiit(&q)Otemp,x^ 
ccmtinue; /* no need to recurse 

} 

ifCDEBUG.MATCH) 4»rmtf(stdout, "main: about to RECURSE: 

mnipas5=»%d\a",niiiq)ass); 

]ibrary[nux9ass] jecnirsionjagged - TRUE; 
recursionjDountei++; 

recurse_tbiougJi_the_Ubraiy(iifiUe(kQax,qpote^ 
ibraiy,&recnirsionjcoimter;aboundar^^ 

libraryfmmpass] jeciirsionjtagged = FALSE; 
recursionjcoimter— ; 

> 

} 

Ubraiy|l±] jrecuisioiijagged = FALSE; 
kk++; . 
}/*nlib*/ 

}/*main*/ 



PURPOSE: 
INPUT: 
OUTPUT: 
PROCEDURE: 



int 



recurse_flirough_the_Kbraiy( 
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int nfilledmax» 

LIB epotemp, 

LIB *epotliilone, 

int nwildcard, 

char wildcards[NfAXWILD][MAXLEI^^ 

int nlib» 

LIB ♦Bbraiy, 

int "hrecuisiQnjcounter, 

int nbouixlaiyjnitoff) 

{ 

int V 

int ecomiM),ele(D=0; 

int mmpass=K); 

int nfi]led=0; 

int lcount=0; 

int previoiis_nmarked=0, current_nmaiked=(^ 

int ♦ smallest^accqptablejpiece = 0; 

char *eptr; 

char *clibptr; 

char boimdmy[MAXNAMBLEN]; 

int new_unmarked_entries_filled=0; 

LIB q)otemp_teinp; 

if(DEBUG_MATCH) :^rintf(stdout,TlECl]RSE: recursion_countei=%d, 
nUb==%d\n",*recursion_counter^T)); 

elen strleii(q>oteaiq> jnonoma:sequeDC6); 
nfilled = 0; 

for(ii=0; ii< elen; ii+f) { if(epotemp.maiked[ii] = TRUE) nfilled-H-; } 
previbusjnmarked = nfilied; 
currentjomarked = nfilied; 
smallest_acceptablej)iece = 1; 
ij0[nfilled >= nfilledmax) { return 1; } 

for (mmpass = 0; nunpass < nlib; nmipass++) { 
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if(*reciiraofacounter >=» RE(mSION_COimm_CUTOFF) { 
retum 1; 

} 

if(PEBUG_MATCH) 4jrintf(stdout,"RECURSB: recairaonj»imtep=%d, 
impass=%d\n^'*T:ecursion_coimte^^ 

if(libiaiy[imnpass]jreaireion^ { 

if(DEBUG_MATCH) J5>rmtf(stdout,TREOURSB: 
Ubra]7[%d]j:6Cursionjtagged=TR^ sIcippiiig\Q*',miDpass); 
continue; 

} 

reset_q)oteiDp(&q)otemp_temp»q>otemp); 

elen = strlen(epotempjtempanonomersequeiice); 
nfilled^O; 

for(ii=0; ii< elen; ii-H-) { if(epotemp Jeii]panarked[ii] = TRUE) nfilled-H-; } 
previous_ninaiked = nfilled; 
currentjunaiked === nfiUed; 

wlule((new jiimiarked_entries_filled = 
maximd_adjacent_aHgDment(&q)otemp_teinp, nwildcard,wjldcards4ibrary» 
nunpa8s,siDallest_acceptable_piece)) >= 1) { 

cmrent jomarked new jnimaiked_entries_fi]led; 
ifODEBUG^MATCBT) ^rintf(stdout, TRECURSE: recursion Jevel=%d, 
XDnipasff=%d, previonsjamarked=%d, ciirr«it_iiniarked=%d\n", 

*recursion_coimter,nimpa8S,previousjai^^ 

cuxrentjimaxked); 

} 

elen = strlCT(€potemp_teinp.monom^equence); 
nfilled = 0; 

for(ii=0; ii< elen; ii++) { if(epotemp_tenq) jnaiked[ii] = TRUE) nfilledH-; } 
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i£(nfilled >= nfilledmax) { 

oulpul_j&esh_aKgiimeiit(&cpot^ 
CQntinue; 

} 

Ubra]7[nunpass] jecursionjagged = TRUE; 
(*i:BCUisionjcounter)++; 

reciDrseJfan)ughjthe_Ubra]y(nfiUed^^ 
nlibylibraiy^recniisicm^coimter^ 

Ubrary[mnipass].recursionJagged-FALS]^ 
(*recmsion_counter)— ; 

}/* nunpass */ 

}/*recursejtiiroughJheJibrary*/ 

/* 

PURPOSE: 

INPXJT: 

OUTPUT: 

returns Uie size of the largest Tnaximal adjacent set of monomeis inserted. 
PROC^URE: 

♦/ 

int maxinial_adjacent_alignment( 
LIB ^epothilone, 
int nwildcaxd, 

char wiIdcarfs[MAXWIIJD][MAXI£N], 
LIB *library, 
int ilib, 

int smallest_acceptabrej)iece) 
{ 

int ii=0,ij=O,kfc=0; . 
int ecoimt=0,lcount==0; 



wo 01/92991 



81 



PCT/USOl/17352 



int epothilonelCTrN^ 

int iilaiges^ieceFK),tQlaigestpieceF<>; 

int holdjausjcx)iintp^»h0ldjbisj^ 

int wildcazdmatch=FALSE; 

char *wptr, 

char ^larges^iecejsptr^^largestpiecejBfrt^^ 

char. *holdjhis j)lace_q)tr, ♦hold_this jplacejsptr; 

int largestpieceJlcoiint^,]argestpiecejecoii^^ 

char *sptr, *eptr, *]^tr,*bui5'*n 

if(DEBUG_WlLDCARD) { 
if(nwildcaid>0) { 

^rint^stdouCmaximal,: wildcaids[0]^%s\n",wildcards[0]); 

} 

if{DEBUG_ALIGN) fyrintf(stdout,*'maximal__adjacent_alignment: 
smallest_accq)tablejiece?==%d\a",smaUest_accei^^ 

sptr = library[ilib].monomerseqiience; 

eptr = q)othiloiie->monomersequence; 

ecoiint=0; 

IcountrK); 

nlargestpiece=0; 

tnlarges^iece=0; 

hold_Jhis jplace_eptr = eptr; 

holdjhis_ecount = ecount; 

while (*eptr!=\0'){ 

sptr - libiaiy[ilib].mQnomers6qiieace; 
lcount=0; 

holdJhisj)lace_sptr » sptr, 
holdjhisjcount = Iconnt; 
wildcardmatch = FALSE; 
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wildcardmatch FALSE; 
if(q>olM(me->maiked[ecoinit] FALSE) { 

/* code for wfldcaids added MAS 05-16-00 */ 

wptr=""; 

if(*eptr = W) { wptr= wildcaids[0]; } 
else if(*eptr = { wptr « wfldcardsEl]; } 
else iS^epix = T){ wptr = wildcards[2]; } 
else if^e^tx = •Z*) { wptr = wildcaidsp]; } 

^e(*wptrN*\0*){ 

if(*wptr=*sptr){ 

mldcardmatch « TRUE; 
break; 

} 

wptrH-; 

} 

if[(wildcaidmatdi = TRUE) |i (♦eptr = *sptr)) { 



tnlargestpiece++; • 

if(DEBUG_ALIGN) 4)rintf{stdout, "FOUND a match: 
leQ=%d, epo(%d. %c), Ub[%d].iiame=%s (%d, %c)\n", 

tnlargestpiece, ecount, *eptr, 

ilibjibrary[ilib] Jiame^lcount, *sptr); 

if(tnlargestpiece > niargestpiece) { 

nlargestpiece = tnlargestpiece; 
largeslpiece_sptr =liold_this_j>lace_sptn 
largestpiece^lcount = holdjfhisjicount; 
largestpiece_eptr = holdjhis j)lace_eptn 
largestpiece^ecount = hold_this_eComit; 
ifp)EBUGj\LIGN) §)rintf(stdoirt, "FOUND a 
largest piece: leii=%d, q)o(%d, %c), Kb[%d]jiame=%s (%d, %c)\a", 

nlargestpiece, largestpiece^ecount, 
*largestpicce_eptr, ilibJibraiy[ilib].nanieJargestpieceJcount, *largestpiece_sptr); 

} 
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sptrH-; 
lcounH+; 

ecoimW-K 

}els6{ 

tnlargestpiece = 0; 
spti+t; 
IcountH; 
/♦NEW*/ 

ho Id jthisjplacejsptr « sptn 
hold jMs^lcoimt - Icount; 
qptr = hold Jhis_j)lace_eptn 
ecoirnt = hold__tbis_ecount; 

» } 
}else{ 

tnlargestpiece==0; 
break; 

) 

} 

tnlargestpiece = 0; 
eptr == hold Jhis_place_eptr + 1 ; 
ecount = hold_Jhis_ecoiint + 1; 
hold_this_place_q)tr = qptr, 

hold_this_ecoimt = ecount; ^ 

} 

i^EBUG.ALIGN) $rintf(stdout,"ALIGN: largest piece match is %d monomers fiom 
%s\n",nlargestpiece,library[ilib]jiame); 

iflpBBUG^ALIGN) ^iriiitf(stdout,"ALrca4: largestpiece_ecoiint=%d, 
larges1pieceJcotmt=%d\n'*, 

largestpiece^ecomitJargestpieceJLcount); 

if(tilaigestpiece>«smaUest__accqptable_piece) { 

if(DEBUG_ALIGN) 4>rintfl[stdout,"ALIGN: incoiporated\n"); 
Icomit = larges^piece Jcount; 
ecount = largestpiece_ecoimt; 
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wluIe(ecoi]nt < (nlargestpiece + largestpiecejecoimt)) { 

epothilQne^aUgaedsequence[ecoviit] » 
Ufaraiy[ilib]anonoinerseqaencencx)im^^ 

stn7y(qpoMone->conteKt[ecoi^ 
q)ofldl(me-c>marlced[ea>iint] "TRUE; 
ifl[ecouiit < (nlargestpiece + largestpiece^ecount - 1)) epothilone- 
>bomidaiytoiig|it[ecoiint] - FALSE; 

lcounH+; 
ecoimt++; 

} 

} 

return (nlargestpiece); 
}/*maximal_adjaceiit_alignment*/ 

/* . 

PURPOSE: 

INPUT: 

OUTPUT: 

returns the size of Qxo largest maximal adjacent set of monomos inserted. 
PROCEDURE: 

♦/ 

int niaximal_adjacent_alignment_andjdump( 
UB '^'epoHiilone, 
int nwildcard, 

char wildcards[MAXWIU)]IMAXI^, 
LB *Ubrary, 
int IKb, 

int smallest_jtcceptable_piece) 
{ 
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int ii=0,jjj=0,lde=0; 

int ecafunH)Jcoimt=0; 

int eleipO; 

int epofldl0nelen=O; 

int nlargestpieceFO,tnlargestpieceF=0; 

int hoMjiiisJcoiinH),holdJiiis_^ 

int wildcaidniatch»FALSE; 

char *wptr; 

char *laiges1piecejq)tr,*laigestpiecejq)tr» 

char *hold_fhisj)lacejEptr, *holdjEhisj)lacejptr; 

int larges1piece_lcoimt=K)Ja]:ge5tpiec6jBCounM)^ 

char *qjtr, ♦eptr, *lptr,*bii4>tn 

ifCDEBUG^WILDCARD) { 
if(nwildcard>0) { 

§)iintfl^stdout,"maxiinal_: wildcaids[0]=%s\a",wildcaids[0]); 

} 

} 

^rintf(stdout,"niarimal_adjacent_aKgnment_and_diimp: 
smanestjacceptablejpiece==%d\n",smaUest_acceptablejiece); 

^tr = librar3([ilib].inonomersequence; 

eptr = epothilone->monomersequence; 

elen = strlen(epothilon6^>monome3rsequence); 

ecounH); 

lcoim1r=0; 

nlaigestpiece^; 

tnlaisestpiece=K}; 

holdJhis_place_qptr = q)tr; 

holdjhis jecoimt = ecount; 

while (»eptr!-^0•){ 



sptr = Kbrary[iKb].monomersequence; 
Icount « 0; 
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/*NEW*/ 
. hold Jhis _place_sptr = sptn 
holdjhis Jcoiint = Icoun^ 
wildcaidmateh =FALSE; 

while(*sptrN^O'){ 

wildcaidmateh « FALSE; 
ij^ei>ot]u]one->niaiked[ecoimt] — FALSE) { 

/* code for wildcards addedMAS 05-16-00 ♦/ 

ifl[*eptr = "W) { wptr = wadcards[0]; } 
else ifl[*qptr — X*) { wplr = wfldcaids[l]; } 
else if(*eptr = Y) { wptr = wfldcards[2]; } 
else if(*eptr = 'Z*) { wptr = wildcards[3]; } 

while(*wpt^!=^O0 { 

if(*wptr = *sptr){ 

wildcardmatch = TRUE; 
breal^ 

} 

wptrH-; 

} 

if((wildcaidmatch = TRUE) || (*eptr == ♦sptr)) { 
tnlargestpiece++; 

iflpEBUG^ALIGN) ^rmtf(stdout, "FOUND a match: 
loFyod, epo(%d, %c), Ub[%d] jQame=%s (%d, %c)\n", 

tnlargestpiece, ecount, *q>tr, 

ilib,library[ilib] Jiame^lcount, *5ptr); 

i^tnlargestpiece > nlargestpiece) { 
nlargestpiece = tnlarges^ece; 
laiges^iece_sptr « holdJhisj)lace_sptr, 
largeslpiece Jcoimt = hold_this Jcount; 
larges^iece_q>tr = holdJhisjplace_cptr; 
laigesfpiGce^ecoimt = hold_this_ecoiant; 
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iflODEBUGjMJGN) J4irintf(stdout, "FOUND a 
largest piece: leaF%d, epo(%d, %c), lib(%d, %c)\d". 

idaiges^ece^ laigeslpiece^ecount 
*largesli>iecej^» largeslpiecejoouiit, *laiges^iecej^tr); 

} 

sptrH-, 
Icoiint++; 
cptrH-; 
ecount-H-; 

} else { 

tnlarges1piece~0; 
sptrl 1 1 
Icount-H-; 

hold_Jhis j)lace_sptr ~ sptr; 
hold Jhis Jcount = Icoimt; 
/*NEW*/ 

eptr = lioldJhis_j)lace_eptr; 
ecount = holdJiiis_ecoijnt; 

} 

} else { 

tnlarges^iece = 0; 
break; 

} 

} 

tnlarges^ece-O; 
q>tr = hold Jhis j)lace_eptr + 1 ; 
ecount = hold_this_ecount + 1; 
holdjthis_j>lace_qptr = ept^ 
boldjfais^ecoimt = ecoun^ 

if(DEBUG j\UGN) { 

4njntl^stdout»"mcFetnentmg hold_thisjplace_eptr===%c, 
hold_this_ecountr=%d\nVholdJhisj^ 
} 



} 
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if(DEBUG_ALIGN) ^rmtf(stdout,"ALIGN: largest piece match is %d monomers from 
%s\n"^argestpieceJibrary[iBb].iiame); 

ifpBBUG^ALIGN) fprintf(stdoiit,''ALIGN: largpstpiecejBCOuiit=%d, 
largeslpieceJcoimtr=%d\n", 

larges^ecejecomit,larges^icceJcount); 

if(nlarges^iece>«smallest_acceptable_piece) { , 
ifPEBUGjVIiGN) J5)rintf(stdout,"ALIGN: incoiporated\n"); 

Icoimt = largestpiece Jcount; 
ecount = largestpiece^ecount; 

:^rintf(stdout,"ALIGN_TARGBT: 
for(ii=K); ii<largestpiece_ecomit; ii-H-) { • 
§)rintf(stdout,""); 

} 

Mdiile(ecomit < (nlargestpiece + largestpiece^ecount)) { 

epo1failone->alignedsequcnce[ecomit] 
library[ilib]jnonome[seqaence[lcomit]; 

strcpy(epothilon©->aUgnedPKSname[ecomit],liT^^ 
strcpy(q)othilone->context[ecoimt],ffl>r^ 
qK)tihilone->marked[ecount] = TRUE; 

if(ecount < (nlarges^iece + larges^iecejecount - 1)) epofliilone- 
>bomdaryt0rigitit[ecomit] « FALSE; 

i5>rintf(std0Tit"%c"Jihra^ 

lcoimt++; 
ecoiint4+; 

} 



fai:(ii=ecount; ii<elen; ii-H-) { 
^rintf(stdouC "); 
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} 

^rintf(stdout,"%s\a"4ibiary[i^^^ 

} 

retam (nlaigestpiece); 
}/*maximal_adjacent_aUgrmientjai^^ 

/* 

PURPOSE: 
INPUT: 
OUTPUT: 
PROCEDURE: 

*/ 

int dutput_fi:esh_aligmnent( 

LIB *epotemp, 

int nboi2ndary_cutofQ 

•{ 

int acoimtr=0,ecoimt=O; 

int longest_segmenflen==<),current_segDientlen==<); 

char *^tr»^eptr^ 

char boimdary[MAXNAMBLEN]; 

I 

eptr - epotenq)->mononiei:sequeQce; 
ecount = 0; 

qx)tenq)->nboundary = 0; 
longest_segmentlen = 0; 
curreiit_segineatlen = 0; 

stiq)y(bomdary,epotemp->aKgnedPKSname[ecomt]); 
while(*eptr!=*\O0{ 

if(epoteii:q)C>boundarytoright[ecoimt] == TRUE) { 
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epotBiiQ)->iiboimd£uyH'; 
i^cuirent_segmentlen > longestjsegmenHen) { 
longest^segmaitleii = cnirreiitjsegmentleQ; 

} 

cuir^_8egmaitlea = 0; 

} 

ciinmt^segmeQtlea'H-; 

ecoimt++; 

eptrH-; 

} 

ii^curreiit_segmentlen > longest_segmentlen) { * 
Icmgest^segmentleQ » cuI^eat_segmentleI^ 

} • 

if(q)otemp->iiboundary > nbonndary^cutofiO return 1 ; 

eptr = €poteinp->monomersequence; 
ecoimt - 0; 
4)rintf(stdout,*'HIT 
while(*eptr !='\0'){ 

if(cpotenq>->aKgncdPKSname[ecount][0] = 'VO') { 

if(epoteix]p->boimdarytoright[ecouiit] = TRUE) { 

:|)rintf(stdout,"%c:TARG(%s)| " *eptr,epote!np- * 

>cont6xt[ecoiint]); 

} else { 

4jriiitf(stdout,"%c:TARG(%s) ",*eptr,epotemp- 

>context[ecoimtD; 

} 

} else { 

if(qpotemp->boundai7tiori^t[ecoiint] ===» TRUE) { 

^rintf(stdoiit,"%c:%4s(%s)[ ",*eptr,epoteiq}- 
>afigaedPKSiiame[ecoimt],qpotemp->context[e(X)imt]); 
} else { 

i|)rintf(stdoiit,"%c:%4s(%s) " *eptr,epotenq)- 
>aKgnedPKSname[ecoimt],q)otenq)->context[ecoin^ 
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} 

} 

ccountH^ 
eptj++; 

} 

i^rmtfi[stdouV'%d %d^qpotenq)->nboi]iidaiyJoxigest_^ 
:^rintf(stdout,""); 

* 

q)tr - epotenip->monoman5equence; 
acoxmt = 0; 
ecount = 0; 
while(*eptr!='\0') { 

ifl[q)oteii^>bomdarytoright[ecount] = TRUE) { 

§nintf(stdout,"%c|",epotemp->context[a<x)^ 

} else { 

§mntf(stdout,"%c^epotemp->context[ac^ 
} . 
eptrf+; 
acount-H-; 
ecount++; 

> 

^rintf(stdout,"\n**); 
return 1; 

}/*output_fi:esh_alignmeiit*/ 

/* 

*/ 

int get_libraiy( 

char *libraryfile, 

LB *library, 

int mterjnodular_db_flag_ofI) 

{ 
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int ii=0jj=0Jkfc=O,lcount==0; 

int nlilH); 

char 'Tm^tr,bujIMAXBT3Fl; 

char tmonome]sequ^ce|>IAXNAMEIJ3N]; 

char *lptr,*^tn 

FILE ♦hljp; 

forOiN); kk<MAX.IJB„ENTRIES; kfcH-) { 
Kbrarypdcl-recursion Jagged = FALSE; 
foiCii=0; u<MAXNAMELEN; ii++) { 

libraiy[kk].naine[ii] = ^0*; 

libraiy[kk].monomCTsequence[ii] = ^D*; 

libTary[kk].alignedsequeace[ii] - 

foitij=0; jj<MAXLEN;ij++) { 

KbraryPdc].aligaedPKSnaine[ij][ii] = ^0'; 
hT)rary[kk].marked[jj] = FALSE; 
library[kk].context[ij][0] = ^0'; 
Hbrary[kk].context[j[j][l] = ^0•; ' 
library[kk].contextijj][2] = 
library[kk].context[ij][3] = ^0'; 

} 

} 

Kbrarypdc] Jiboundary = 0; 

} 

/* read in the library fiom PKS.Kb ♦/ 
ifO>IULL==KUbp=€opa^ { 

^jrintf(stdout,"TRY AGAIN; coiddnt open Kbrary file: %s\n",Ubiaryffle); 

nKb=(^, 

exitO; 

} 

iilib=0; 

while(iiHb < MAX_LIB JENTEUES) { 
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if(NULI;==fg@ts(bu^sizeof(baQ4ib^ 

if(*ba^tr = W) continvi^ 
ifi:»bu^tr!='\n'){ 

I* 

sscai]fi[bii4>tr, "%s %s 
%s",libiaiy[]]lib]jiame,tmo]K)meiseqaeDce,Iil^^ 



^tr = libraiy[n]ib].nam^ 

while((*bi4>tr != ' ■) && (*bufttr N ^0•) && (*bu^lr N "VnOX 
*^1i++ = *buJ|)lrH-; 

} 

if((*bu^tr != \0') && (*bi4>tr != '\n')) bui^jtrH-; 



Iptr = libraiy[nlib].monomersequence; 

whae((*bufytrl=' •) && (*buj^tr != ^0') && (*bufptr != \q')){ 
I* 

This code qjecifically deletes inter-modular double bonds when the -d 

optionisset 

*/ 

^if(inter_modular_db_flag_^off=TRUE){ 
if(*bufytr !='='){ 

*lptrH- = *biii5)trH-; 

} else { 

bu^tr++; 

} 

}else{ 

'•lptrH- = *bn^trH-; 

} 

> 

*lpt^=^0'; 

ifl[(*bu^tr != 'NO') && (*bufytr != "VnO) bu§>trH-; 
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^tr » ]ibraiy[i)]ib].amiotatedseqiiaice; 
wlule((*bu^trN' •)&&(*bl#t^N^O')&&(*bu^t^N^n')){ 
*^>trH-=*bU3|rtrH-; 

} 

•lptr=W; 

ii((*bu*tr 1= -yOO && C'ba^ != '^')) bu§)ti++; 

^riiitf(stdout,'lJBRARY(%d) %s: 
%s\a''^bJibrary[ii]ib]jiame,lilH^]7[nlibijnoiK>m6i5^ 

^rintf(stdout."LIBRARY(%d) %s: 
%s\a",n]ib^my[olib].iuime,libraiy[i)Ub]^ 

iilib++; 

} 

} 

fclose(libp); 

for(kk==0; kk< nUb; kk-H-) { 
lcoimt = 0; 

l^tr - KbraryPdc] .monomersequence; 
while(*^tr !=^0'){ 

ifl3coT2nt==0){ 

Kbiary[kk].context[lcount][0] = 

library|Xk].context[lcount][l] = *lptr; 

aiaiy[kk].context|lcoiint][2] = *(lptr + 1); 

}else{ 

i^coimt = (strleQ(library[kk].moiiomersequ^ce) - 1)){ 
KtaaryPdc]xo2itext[lcoimt][0] = *0^^^ 
Kbrary[kk].context[lcoimt][l] ♦Iptr; 
Hbiary|Tck].cx)iitext[lcount][2] '= 

} else { 

KbraiyITkk].coiitext[lcoimt][0] = *(Iptr - 1); 
libraiy[kk].context[lcoimt][^ « *lptr; 
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Kbrai7i>k]x<»Qtext[lcount][2] = *(lptr + 1); 

} 

} 

lihrary|T±].contexA[lc^ = , 
lcoimt++; 

} 

^rintf(stdouV'LBRARY(%d) %s: 
%s\n**Jkk,Ubi:ary[kk]j[iame4ibrar^^ 

j5)rmtfl[stdout,'*LIBRARY(%d) %s: 
%s\a"^Kbi:aryITck]jiameJibra^ 

for(y=<);jj<strlen(Ubrary(Tdc}.monomers^^ * 
i|)rintf(stdout,"(%s)»*Jdbmy[ldc].cont^^ 

} 

:^riiitf{stdoiit,"\ii"); 



return nlib; 
)/*get_Kbiaiy*/ 



/» 

♦/ 

int duinp_STARTER.aliga( 

LIB epothilone, 

int nwildcard, 

char wfldcards[MAXWIU)][MAXMN] 

) 

{ 

int elen=0; 

int ecoiint=04iDld_ecoimt=0; 

int wildcardmalch=FALSE; 

char *5ptr,*q)tr,*wptr; 

elen = stxlen(^othilone.monQmersequence); 
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eptr = qx>fhilaae jnoaoiiiersequ^ce; 

]5printf(5ldoiit,"AIJC3N^^ 

while(♦eptr!=^0•){ 

§>riiitfl[stdout,"%cVeptr); 

cptri I ; 

} 

§jrintf(stdout,"\a"); 
ecounN); 

eptr- qpothilonejnonomerseqaenc^ 
sptr - epothilonaaKgnedsequence; 
^ntf(stdout, "ALIGN JTARGET: 
while(♦ept^!=^(y){ 

wildcardmatch » FALSE; 

wptr=""; 

if(*eptr = 'XO { wptr = wildcards[0]; } 
else if(*eptr = TT) { wptr = wildcards[l]; } 
else if(*eptr = 'Z*) { wptr = wildcards[2]; } 



while(*wptr !=-^00{ 

ifl[*wptr==*sptr){ 

wildcardmatch = TRUE; 
' break; 

} 

wptrH^ 

) 

i£(wildcaidniatch=TRUE) { 

^ptiotfCstdou^":"}; 
}else{ 

ifi[*«ptr=*sptr){ 

4>riiit^stdoat,T): 
} else { 

^rintfljstdout, " "); 

> 

} 
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eptrH-, 
splr++; 

} 

j5)riiitf(stdou^na"); 
i^iiiitf(stdoiit, "AUGNTARGET: 
qptr epbthilcme jDQioiiamersequm 
ecoimt=0; 

whae(*eptr!='\D'){ 

ii^epotl]ilone.aIigaedseqaence[ecoiint] = ^0*) { 
^rintfCstdout," 

}else{ 

i|>rmtj^stdout,'Toc'^epothiloiie.aHga6dseqaeiK:e^ 
hold^ecoimt = ecount; 

} 

eptrH-; 
ecount++; 

} 

^rintf(stdout," %s\n'^q)othaone.aUgnedPKSi2ame[hold_ecoimt]); 
15)rintf(stdout,"STARTERjaiGN:\ii"); 

}/*dun3p_STARTER.aKgii*/ 

/* . 

*/ 

int reset_epotemp( 

LIB q)ofhilane) 
{ 

int jj=0, den^O; 

dea strlea(epothilQne.moiiomerseqpience); 

stiq>y(6potemp->name,qpothilonejiame); 
st]xpy(epotemp->monomersequence9q)ot^^ 
strq>y(6potemp->aligaedsequeiice,epotfailone.aUgtie^ . 
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q[X>tenqp->nbotindary - 0; 
fiwGj=<);ji<cleii;jrH'){ 

stix7y(epoteDq>->aUgQedPKSname|jj],^^ 

epoteiiq>->maiiced[jj] - epofhilonejnaikedQj]; 

stn7y(epotedq^>context[]j],q>oM 

q)oteDq>->boundarytoright[jy = q)o11iU(me.boiaijdarytoright|jj 

} 

}/*tesetjepoten5)*/ 

Thus, the present inveation provides a useful laeans to generate new PKS genes and 
corresponding enzymes to produce polyketides. The invention having now been described by 
way of written description and examples, ttiose of skill in the art will recognize that the 
invention can be practiced in a variety of eml^odiments and that the foregoing description and 
examples are for purposes of illustration and not limitation of the following claims. 
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WHAT IS CLAIMED IS: 

1 . A method for representing fhe structure of a pol}t^de produced by a nK)dular 
polyksAde synHiase, said method conqnising the steps of: 

(a) defining a set of monomer units of which said polyketide is 
composed, 

(b) assigning an alphanumeric symbol or symbols to eadb different 
monomer unit in said set, 

(c) identil^ing one or nu)rBmonom^m said set that is present in said 
polykedde, and 

(d) composing a string ofsaid symbols ordered in a manner reflecting 
the order in which said monomers occurs in said polyketide, wherein said string 
of symbols rq)resents the structure of said polyketide. 

2. The method of claim 1 , wherein said monomer set comprises two-caibon unit monomers, 
wherein a first carbon of said unit is substituted with hydrogoi or methyl, and a second carbon 
of said unit is substituted with oxygen, hydroxy, or hydrogen, and said two caxbon unit 
comprises either a single or a double bond between said first and second carbons. 

3. The method of claim 2, wherdn said monom^ set additionally comprises one or more 
members selected fix>m the group consisting of two carbon unit monomers in which said first 
carbon is substituted with hydroxy, methoxy, or ethyl; a moiety corresponding to an amino acid 
or amino acid derivarive incorporated into a PKS by a non-ribosomal peptide synthase; a moiety 
correspondmg to a structure incorporated into a poljdcetide by an AMP ligase or a CoA ligase; 
and a moiety conesponding to a stnicture correq>onding to a structure in a pol^^de af^ 
modification by a pol^etide modification enzyme. 
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4. The method of claim 2 wherein the set of monomer unit snd coiresponding symhol 
comprises: 




5. The method of claim 4 wherdn the set of monomer unit further comprises a 
miscellaneous monomer that is assigned tiie symbol Q. 

6. The method of claim 4 wherein the set of monomer unit and corresponding symbol 
fiurther comprises 




R R k 
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7. A database of pol^eddes, in which each said meniber is zq>ieseiited by a string of 
alpha-numeric s^bols, wbecda said symbols iqixeseat structural subunits of said poljdrctide, 
and said string lepres^ fb& order in which such subimits occur in said polyketide, 

8. The database of claim 7 that includes at least 100 different pol}4cetides. 

9. The database of claim 7 wherein each said mCTber is represented by a CHUCKLES 
string. 

10. The database of claim 7 wherein each said member is represented by an annotated 
CHUCKLES string. 

1 1 . The database of claim 7 wherein the symbol and its corresponding structural subunit are 
selected &om the group consisting of 




and Q for a miscellaneous monomi^. 
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12. Hie database of claim 7 wherdn fhe symbol and its corresponding structural subunit are 
selected fiom the groiqp consisting of 

OH OH OH OT 



OH OH 

M= ^V^; N= ; A'= ^^^^ ; B'-x-^v^; 




R R 

i" r 1 y 



R R R R 



R R R 



and Q for a miscellaneous mGnomer. 

13. A database of polyketides, in which each said m^tnber is rq)resented by a linearized 
rqnesentatian of said polyketide. 

14. A method of designing a PKS gene capable of producing a desired polyketide, which 
method comprises: 

(a) defining a string of alphanumeric symbols r^resenting the structure of said 
polyketide, 

(b) comparing said string to a database of strings of alphanumeric symbols 
representing polyketides produced by PKS genes. 
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(c) identifying common elements in said string rqjiresenting the stnicture of said 
polyfcetide with elements in sdd strings in said database^ and 

(6) generating one or more new strings fiom dements identified in step (b) that 
matdhi said string r^resenting &e stmcture of said pol^etide, whoein said new string defines a 
PKS g^e equable of prodacmg said polyketide. 

15. The mcffhod of claim 14, iniiapeinaU possible PKS genes enco 
fiom said database are generated and displayed. 

16. The method of claim 14, wherein said new strings generated in step (d) are rated and 
displayed in m order based on one or more parameters. 

17. The method of claim 16, wherein said parameters are selected from the group consisting . 
of number of non-native module int^aces and number of non-native protein inter&ces. 



wo 01/92991 



PCTAJSOl/17352 



DEBS1 ' DEBS2 DEBS3 




6^EB(1) 

R,=:OH, R2=CH3 Erythromycin A (2) 
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Figure 1 
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99 o counter dockwf&a 
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Figure 2 
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aiUCKLES: ADQJDD 

SMILES: 

Cl(=OHC®H](C)[C@@H](OH)-{C@@H](C)[C@@HI(OH)- 

[C@@H](QC-[C@@H](C)C<=0HC®HI(C)[C@@H1(C)- 

[C@@H](Q[C@@II](CQ01 



Figures 
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